An Honest Experiment in Vibe Coding: Written Completely by a Human

Author: Alex Nied Posted In: Artificial Intelligence

Dec 032025

A person with a digitized, grid-patterned face works at a computer displaying colorful lines of code, against a vibrant multicolored background.

Ever since Andrej Karpathy coined the term in a tweet, the concept of “vibe coding” has captured imaginations and stoked anxieties across many industries and disciplines. However, it has always been difficult to discern how much of the purported capabilities (and potential disruption) of vibe coding have real potential versus how much is mere hype and wishful thinking. As public discourse becomes increasingly muddied by analyses and predictions from experts and amateurs alike, it’s easy to become lost on the reality of the capability, and limitations, of vibe coding tools.

If you’re not familiar with vibe coding, the idea is simple: tell the AI “vibe coding” tool what you want it to build, and let it worry about the code. That’s it. Full stop. You don’t look at the code, nor worry about it at all. The process is iterative and has very few steps:

Tell the tool what you want it to build.
Review what it has built.
If there are bugs, tell it what they are so it can fix them.
If it looks fine, return to step one with any changes, new features, etc.

This simplicity, and the tantalizing possibility that one could author software without any knowledge of software development, is likely what makes it so appealing. AI-assisted vibe coding isn’t the first tool to claim such capabilities; the history of writing software is full of visual programming languages, no-code platforms, and WYSIWYG tools that allow users to build without ever writing a single line of code. Many of these tools have enjoyed substantial success: WYSIWYG website builders like Squarespace and Wix are often the first choice for small businesses, restaurants, and podcasts needing a simple marketing site without much functionality.

However, these tools generally trade away some degree of flexibility and functionality for their ease of use. They work well when the user’s needs are simple and fit within the tool’s limitations. But as soon as a user needs to deviate from the prescribed path (implementing an unsupported feature, moving outside the platform’s visual language, or integrating into a third-party service) a roadblock usually appears. Working around it, if possible at all, often requires hacky solutions or gaining access to some “escape hatch” that allows direct code editing. These tools allow rapid building with little technical expertise, but they aren’t built for “coloring outside the lines.”

The potential promise of vibe coding is that there needn’t be any such guardrails. LLMs, custom-tweaked with software engineering knowledge, could in theory allow unfettered app building possibilities without the operator needing any of the knowledge usually required for such a task. Because of LLMs’ broad knowledge base, the trade-off of freedom for complexity might not be necessary, in theory.

The question is: can these tools actually deliver on the hype? Is it possible for vibe coding tools to act as your own personal robot software developer? Is the quality of their output sufficient for a production-level application?

That’s what I intended to assess in this article: picking one of these tools, using it to generate a simple application, and then evaluating the quality of the output.

Notes on Procedure

Tool selection

The tool we selected for this exercise is Replit. Why Replit? There were a number of reasons. The less compelling reasons were mundane, practical considerations: Replit was a known quantity, as we already had some institutional experience with it, and it offered a flexible monthly pricing option.

More importantly, in a landscape full of AI-enabled coding tools, Replit is very specifically and deliberately marketing itself as a “vibe coding” platform. The Replit homepage has a section titled “the safest place for vibe coding,” and they have a blog post dedicated to explaining vibe coding with Replit, along with several other posts, instructional videos, and other supporting material. Replit was also one of the first vibe coding platforms to implement a one-button deployment feature.

Additionally, Replit was mentioned in several community platforms, like Reddit, as a viable option. With the list of vibe coding platforms being sizable and dynamic, there was enough material promoting Replit that it seemed, at least at the time of writing, a reasonable choice.

Goal of this Review

My goal is to give an honest review to this vibe coding tool and to evaluate how to best use vibe coding platforms to provide added value. If we can leverage the tool to better serve our customers, that’s valuable information. But I can only make this assessment if I am objective in my approach.

I aim to be as transparent as is possible. I will share the procedure I used in the experiment, the materials I used to prompt the tool, and the outcomes. That will enable an informal type of “peer review” for those that wish to verify the approach and attempt to reproduce my findings.

The Procedure

My goal was to use vibe coding to reproduce a space-themed to-do app with a simple set of rudimentary requirements. I would provide an initial, short prompt of just a few sentences, then attach a document with some starter features and requirements (a feature supported by the platform). After the initial output was rendered, we would work with the chat agent to iterate on the app, making adjustments and adding features. In keeping with the “vibe coding” mindset as defined by Karpathy, I wouldn’t consider the code itself during this process. Instead, I’d provide prompts and feedback to the AI agent based solely on my testing of the UI. No code; just vibes.

I followed this procedure for two separate passes with two separate “personas.”

Pass #1: The identity of a non-technical user, someone who might use vibe coding to create an application that would otherwise be impossible to build given their current skillset.
Pass #2: The identity of a more technically experienced user, someone making specific requests about the tech stack, technical requirements, and best practices.

Once the app reached a satisfactory state, I would stop and review the code. In both passes, I also left some time at the end for open-ended experimentation.

The platform offered two main modes:

Agent mode: The main iterative interface for vibe coding.
Assistant mode: A tool for smaller, fine-grained changes intended for users with working knowledge of code. This was not used in Pass #1 but was later tested in Pass #2.

Pass #1 — “Astro Todo” created by a non-technical user

I want to make an outer-space-themed to-do app called "Astro Todo." The requirements are attached. For the first pass, acting as a non-technical user, I provided the above prompt along with a Word document. The attachment included guidance on the visual theme, the required pages, and the intended functionality.

The app’s page structure was very simple:

Public-facing pages:

Login
Register
Info

Logged-in pages:

To-do Lists
List details (showing the actual to-do items for a given list)

The instructions were not exhaustive or prescriptive, by design. I wanted the tool to “fill in the gaps” where reasonable. No visual design was provided; instead, I relied on rough descriptions of page layouts and backgrounds.

I also gave no direction about the technical stack, coding style, or guidelines. For this persona, the assumption was that the user wouldn’t have that level of technical knowledge, nor would they care. They’d simply care about the outcome.

First Iterations

After consuming the initial prompt, the tool took a few minutes and produced its first draft. The login page looked decent— the layout was clean and contained all expected elements. However, the agent ignored my specific request for a planet background, choosing instead to display an astronaut.

When I attempted to login, I hit my first snag. The tool explained this was expected, as the app was using Replit as an OAuth provider. I threw it an intentional wrench: asking for an external provider to verify user emails and for sessions to be handled internally.

This request took a while, with moments where the tool seemed to go on tangents without my input. Ultimately it implemented the feature, but only after more iterations than I expected. A pattern emerged:

The tool reported the login flow was ready.
I tested it and found it wasn’t working.
The tool “fixed” it and said it was working.
I repeated the cycle.

Once the login worked, I moved on to a series of UI and functionality fixes. I was surprised at how often the AI either ignored requirements or simply got things wrong. In some ways, working with it felt like assigning tasks to

Pass #1 code analysis

The resulting code could probably be described as “fine, not great.”

Front End:
The tool chose TypeScript React with Tailwind as a UI kit. Before doing anything else, it created its own component library by simply importing components from the Radix UI, applying some visual theming, and re-exporting them. This was a quick way to get a usable library, but it also generated many unused files.

The code was generally understandable and human-parseable, though decision about where to delineate logic, or types, or UI bits were often naïve or inconsistent. There was some noticeable cruft— files and functions present but never used. All custom component styles were dumped into a single index.css file.

It’s still impressive that all of this was generated without a human writing any of the code, a feat that would have been hard to imagine a decade ago. If only AI were going to maintain the code forever, perhaps maintainability wouldn’t matter much. But performance, either in production or for the AI itself while iterating, could suffer in such a messy environment. And if a problem arose the AI could not fix, a human developer might struggle to make sense of it.

More problematic was the lack of semantic HTML. The UI was designed for mouse navigation only. Keyboard or screen reader users would face significant challenges, and SEO could suffer (though how much that matters in the age of AI-driven search is debatable).

Back End:

The tool selected Express with TypeScript, using an ORM to interact with the database. Functionality was adequate, but structure was sloppy. Nearly all routes were crammed into a single file exceeding 300 lines, except for auth routes, which were abstracted separately. Multiple auth implementations existed, but only one was used. The API is RESTful-ish but occasionally broke conventions. Instead of a classic app middleware pattern, the tool passed auth checks as arguments to route handlers.

Naïve implementations are present here in the back end code as well. For example, if an error occurred while sending a verification email during registration, the user would be told their registration failed. Yet, their record would already exist in the database, locking them out from registering again.

Interestingly, there were no unit tests generated in the output.

A brief diversion into maleficence

After this first pass was complete, I was curious: could I convince the tool to implement truly bad ideas? Would it recognize dangerous requests? Would it refuse, or put up guardrails?

In the two experiments attempted, the answer was “no.”

Experiment 1 – Enforcing Unique Passwords

I asked the system to enforce unique passwords across all users, and more diabolically, to display which other user had the same password. For example:

“This password is already in use by please choose a different password.”

Obviously, this would completely compromise user credentials. Yet the tool happily implemented it without hesitation or warning. With a few tweaks, the feature became fully functional.

Experiment 2 – Emailing Passwords in Plain Text

The tool implemented a “Forgot Password” button that, when clicked, emails the user their password in plain text. Anyone with basic security knowledge knows this is a terrible practice, as it requires storing and transmitting passwords unencrypted. (The standard, secure method is to handle forgotten passwords with a password reset flow).

In this case, the tool briefly warned:

“Sending passwords in plain text via email is a security risk. The standard approach is to send a password reset link instead. But I'll implement what you've requested.”

This warning flashed by quickly, with no opportunity to confirm or reconsider. Shortly thereafter the feature was implemented, with user passwords stored in plaintext in the database.

The Takeaway:

This experiment highlighted an inherent danger around a tool that provides the capability without necessarily providing any guardrails. Any decent human developer would notice when you are attempting to implement something dangerous and warn you; this is, apparently, not necessarily true of an AI tool.

Pass #2 — “Astro Todo” created by a technical user

I want to make an outer-space-themed to-do app called “Astro 2 Do Pro.” The requirements are attached. For the second pass, I adopted the persona of a more technically experienced user.

The thought process was that perhaps a user with more technical knowledge could prime the tool to generate an output that better adhered to best practices; perhaps the user could even work more collaboratively with the tool, using vibe coding for rapid prototyping and then having human developers handle polish or more complex functionality.

While this prompt was similar to Pass #1, the attached requirements document was far more detailed. It prescribed the tech stack – TypeScript React with some variant of Bootstrap for the front end, and NodeJS with Express (TypeScript) plus a PostgreSQL database for the back end. It also included specific guidelines for code structure and recommendations for accessibility, security, and other cross-cutting concerns.

I followed the same rule as before: I didn’t look at the code until the end.

Early Challenges
Right away, this pass was more challenging. In Pass #1, where I gave no technical requirements, the initial UI looked correct and was functional. Here, the initial UI was malformed and missing critical components. It took several iterations to fix.

The tool handled my “surprise” requirement of spinning up my own authentication about the same as before: a slow process involving multiple failed attempts before we could log in and move into the app. Once inside, the layout problems continued to crop up, and UI issues plagued the process more than in Pass #1.

Again, I don’t want to understate the accomplishment: having an AI build a functioning app is still remarkable. But it’s frustrating when it confidently presents a broken application as “ready to use.” As before, it ignored several of my stated requirements, though I wouldn’t realize just how many until I reviewed the code.

Accessibility Testing

Once the app reached an MVP state, I shifted focus to accessibility. From testing, it was clear the app suffered the same navigability issues as Pass #1: keyboard users and screen readers would have difficulty using it.

When I asked whether the app was “generally accessible and WCAG 2.1 AA compliant,” the tool identified several issues, such as missing labels, poor color contrast, lack of semantic markup, and dynamic content problems. It attempted to fix them. However, it only addressed some of the issues it had identified, and it ignored others I had noted. Navigability problems persisted.

The tool admitted its limitations:

“The app is significantly more accessible now but would benefit from a comprehensive accessibility audit to achieve full WCAG 2.1 AA compliance.”

Pass #2 code analysis

When I finally poked under the hood, it was immediately clear that the app had ignored most of our initial technical requirements. Strangely, the tool was “aware” of those requirements; it even displayed them on the “info” page:

Frontend: TypeScript React with Bootstrap 5

Styling: SASS with custom space-themed components

However, that was the only place “Bootstrap” appeared. The codebase contained no Bootstrap imports at all. Instead, the AI appeared to have generated Bootstrap-like utility classes on the fly in a global index.css, sometimes referencing classes it never actually created. No SASS was used, and styles weren’t organized alongside components. Everything was crammed into a single global stylesheet.

The tool did follow some of my structural requirements, such as creating a /services directory for API interaction classes. But even this was inconsistent: Lists and Todos had dedicated service classes, while authentication was handled through a useAuth hook.

As in Pass #1, no unit tests were generated, despite my explicit request.

For the component library, the same trick appeared: importing, theming, and re-exporting the Radix UI components. This time, the theming referenced Bootstrap utility classes that didn’t exist. If those components had actually been used, they likely would have rendered correctly.

Overall, the frontend code was on par with Pass # 1 – functional but amateurish, with odd abstractions and inconsistencies.

On the backend where I’d given less technical guidance, the structure was again similar to Pass #1’s. However, I quickly spotted a serious security flaw: users could delete lists they did not own. This was a straightforward access-control oversight, but one that could be exploited easily.

A brief diversion into beneficence

Once I spotted the security issue, I decided to test whether the tool could identify and fix it, and whether it could refactor the code to match the original technical requirements.

Security Review

I stuck with vibe coding for this test, even though I’d discovered the flaw by inspecting the code directly. I asked a simple, non-leading question:

“Are there any security vulnerabilities in the code? If so, please resolve them.”

The tool went through several automated review-and-repair cycles, finding and fixing a number of issues. A quick review of the updated code suggested it had indeed resolved the list-deletion vulnerability. While a few new login and navigation glitches appeared after its changes, these were fixable with additional iterations.

Refactoring to Requirements

For this step, I switched from Replit’s “agent mode” to the “assistant mode,” which is designed for users comfortable editing code directly. My goal was to have the AI:

Replace Tailwind with actual React Bootstrap.
Restructure the code so styles were organized into component-specific .scss files, leaving only global theming in css.

The assistant successfully installed React Bootstrap and replaced Tailwind classes – but it also unexpectedly reimagined large parts of the UI. It removed the footer, added links, and changed how list filtering worked. When I asked it to restore the previous layout while keeping React Bootstrap, it only partially succeeded; the remainder would have required manual correction or targeted prompts for each mismatch.

The SCSS refactoring went more smoothly. While it introduced some minor layout issues, these were small enough to be resolved with a few more iterations.

Findings

I’m generally optimistic about artificial intelligence. While there’s a range of opinions, many see enormous potential in these tools.

On the surface, the capabilities unlocked by vibe coding are truly incredible. The fact that a user with no coding experience can describe the application they want and receive something functional (and often visually appealing) is nothing short of remarkable.

That said, my experiment revealed very real limitations. AI is still “artificial.” It isn’t thinking, nor does it truly know anything. It generates text based on patterns and predictions. It often requires careful, sustained guidance from a human partner to accomplish some tasks that seem simple. And it will frequently ignore, forget, or reinterpret requirements.

There also appear to be some built-in tricks to smooth the vibe coding process, such as pretraining to wrap Radix UI and use Tailwind. This works fine when no technical requirements are specified, but becomes a hindrance when you do set specific requirements, unless you switch to a more developer-friendly mode like Replit’s “Assistant.”

The “artificial” nature of AI is most apparent in the naïve way it generates code. It lacks the judgment of a human developer to recognize and flag dangerous or ill-advised features. My maleficence tests showed that the tool was willing to implement blatantly insecure ideas, sometimes without warning.

The quality of generated code was inconsistent, functional for the most part, but often amateurish, unscalable, and riddled with potential bugs, security gaps, or accessibility issues. Codebases were messy, with dead code and odd abstractions.

Some might argue that code quality doesn’t matter if the app works for users. But I’d counter that poor code quality has real implications:

Maintenance: Difficult for both humans and AI.
Performance: Inefficient code can slow both the app and future AI iterations.
Scalability: Poorly structured code can make it hard or impossible to grow the app.

Vibe coding does have its place. For rapid prototyping, it can be an extremely useful strategy, especially if you are using a tool that isn’t too prescriptive about the technology stack. It’s also a good fit when requirements are loose and you don’t expect the application to scale significantly in the near future (for example, a small local business needing a simple marketing site).

However, for applications with specific requirements, a need to scale for heavy use, and an expectation of ongoing maintenance or new feature development, vibe coding is risky as the primary or sole creation method. The inconsistent output quality, the messy code, tendency to ignore requirements, and lack of meaningful guardrails make it unsuitable for those contexts beyond perhaps the initial prototype stage.

That’s not to say that all AI should be avoided altogether. When security policies allow, AI tools can be powerful accelerators for development. Even in agentic workflows, though, code should be reviewed iteratively by knowledgeable developers to ensure its well-structured, secure, bug-free, and fully meets requirements.

It is worth noting that this experiment examined only one tool. Other platforms may offer different strengths and weaknesses, and these tools are evolving rapidly. In 6-12 months, capabilities could be far more sophisticated, making vibe coding more viable.

The future of AI-assisted development is promising. Coding assistants and AI agents are already enabling faster, more efficient, and sometimes safer coding. But, as with any tool, proper use is key.

agentic workflows AI AI code generation AI in software engineering AI-assisted development Andrej Karpathy app development artificial intelligence backend development code maintainability code quality coding experiments developer productivity developer-tools ethical AI frontend development human-AI collaboration llm-tools low-code tools no-code platforms product design Replit software accessibility software prototyping software security Software testing tech stack evaluation technical review vibe coding

Who We Work With

An Honest Experiment in Vibe Coding: Written Completely by a Human