By Brian Kimmel — Sep 4, 2025

Vibe Coding an App With GPT-5, and Then Trying to Hack it

With agentic ai coding tool adoption skyrocketing, are LLMs dumping millions of lines of insecure code onto an unsuspecting public?

With LLM vibe-coding tool adoption skyrocketing, I’ve been thinking a lot lately about the software that we’re releasing into the world. Security is always an afterthought, but agentic coding is generating an unprecedented volume of new production code. Are we dumping millions of lines of insecure software onto an unsuspecting public, hoping it won’t be the subject of the next data breach or high profile hack?

I decided to find out more about the risk of cybersecurity holes in vibe-coded apps by building an app and then trying to hack it.

The App - A Todo List

In order to fully test the default security that a non-programmer could expect to get from vibe coding an app, I chose the simplest use case that has both a frontend and a backend, with user login functionality: a todo app.

Since I wanted to simulate a non-programmer vibe coding, I had an additional constraint to the experiment: I never mentioned security to the LLM. I said that I wanted “user login functionality” and a backend but didn’t say anything about security requirements. The assumption being that a non-engineer wouldn’t even know to ask for security, they’d be focused on the product functionality.

I started off by writing a very general prompt and asking the ChatGPT to generate a requirements doc to be fed to GitHub Copilot. I’ve had success with Copilot and Cursor when breaking and idea down into discrete tasks, but ChatGPT can do all the legwork of writing up the details for me. Here’s my initial prompt to generate the doc:

Taking this description of an application, generate a set of prompts for a coding agent to implement iteratively step-by-step:

Create a todo list app with a good UI design, with the following requirements:

1. app should have a backend + fronted + database + login, running in docker containers using micro services.
2. app should be a simple todo list, clean design
3. feature: recurring items
4. feature: ordering of tasks
5. feature: check-off tasks

I spent some time vibe coding, and GPT-5 seemed like it wanted to focus on the backend services quite a bit. After an hour or so it had created docker containers for the backend api, database, and a gateway, as well as a barebones front end that didn’t do much beyond allowing the user to login.

Despite mentioning a good UI design in my initial prompt, GPT-5 didn't do a great job of giving the app any fit and finish. I've found that for UI/UX, you have to be a little more specific than "make it look good," my favorite trick for that is to choose a random aesthetic from the Aesthetics Wiki. I scrolled around a bit until I found something light and fun: Bubblegum Witch.

The vibe-coded todo list with a Bubblegum Witch aesthetic overhaul

With the new UI overhaul, I figured that I was at the point where a non-engineer vibe-coder would probably say that their MVP was done and publish. I didn't want to tweak too much because we're not building an app, we're kicking the tires on security, so I moved to the next phase.

Is Bubblegum Witch Todos Secure?

The short answer is, sirprisingly, yes. Here's what I tried and what the outcomes were.

Manual Testing

I have to admit at this point that I am biased. I strongly feel that these LLM coding tools are here to stay, but that the corporate dream of replacing engineers with agents is irrational exuberance. One area in particular that I am concerned about, and why I wanted to do this experiment, is application security. I was expecting to find several major security flaws.

So I was excited when I had a polished-looking application and wanted to try hacking it myself first. I'm not a hacker, but I am familiar with the OWASP Top 10.

The first thing I tried was a Cross-Site Scripting attack. I entered a task named <script>alert();</script> , which the application correctly displayed verbatim, but did not execute. Next, I used Chrome dev tools to examine the html surrounding the element where task names were displayed, and try to break out of the div using <script>alert();</script> </div><script>alert("hi")</script> <div class="title">

The app still displayed the input verbatim without executing the javascript code.

Next, I tried a SQL Injection attack by entering ' OR 1=1;–- to try to break any non-parameterized SQL and execute arbitrary code, but again, the application didn't execute the code.

The last thing that I manually tried was poking at the auth API to see if I could gain API access without login credentials. I booted up Postman and tried to access the API URLs without an auth token, but wasn't able to get past the security.

The Real Test: SAST Analysis

I know a few tricks, but I am not an expert hacker. Smarter people than me have developed code scanners that detect security flaws in application code, called SASTs. Static Application Security Testers are scanners that work similar to linting, but for security.

The first tool that I tried was SonarQube which was astonishingly easy to set up and use. There is a docker container for the web UI, which you can run using the command:

docker run -d --name sonarqube -e SONAR_ES_BOOTSTRAP_CHECKS_DISABLE=true -p 9000:9000 sonarqube:latest

Once that's up and running, you can connect to the UI and it will give you all the steps to scan an application locally. Here are the results of running SonarQube against the generated code:

It had 0 Critical or Major findings, and a few Medium findings related to docker containers running with elevated privileges. SonarQube did not find any vulnerabilities in the application itself.

The next tool I tried was semgrep, a lighter-weight security scanning tool that also scans for secrets stored in the codebase. Semgrep is even easier to install and scan than SonarQube. On a Mac, just run brew install semgrep from the terminal, and then run

semgrep scan --config auto

Here's the output for the Bubblegum Witch Todo app:

All 5 of the findings were that same elevated privilege issue that SonarQube found. I don't consider those flaws in the application, but it's worth noting that you should pay special attention to your infrastructure when deploying a vibe-coded app.

Conclusion

I have to admit, I was very surprised that I was not able to easily hack this application. I thought that there would be some obvious security flaw that I'd be able to exploit immediately, but according to the scans and my amateur hacking results, the code is reasonably secure.

But the actionable steps are the same regardless of the outcome of my testing. The downside of deploying LLM-generated code with invisible security flaws is huge, so make sure to review and understand the way your app works before publishing it.

In addition to a manual code review, I think SAST tools are essential to getting a read on your application's security risks. SonarQube and semgrep are so easy to use that you should be including one or both in your CI/CD pipeline as a matter of course. Also keep an eye out for new GitLab and GitHub CI/CD security features, both are working on AI assisted scanners that could represent the next generation of vulnerability scanning.

Further Testing

I know that my testing was informal and non-exhaustive. If you're interested, feel free to clone the Bubblegum Witch Todos repo and run your own tests. Be sure and let me know if you're able to bypass security. As a software engineering community, we have a collective responsibility to analyze, evaluate, and report on security problems with these LLMs as they become more popular.

A subject that I didn't cover in the testing, that is no less important, is data privacy. Vibe-coded apps that ignore industry specific privacy laws like HIPAA and locale specific laws like the GDPR can be just as damaging as ones with major security flaws.