What VCs Don’t Understand About Vibe Coding

If you're not using AI tools to code in 2025, you're just behind the curve. But vibe-coding on an existing codebase is like handing the keys to your car to a drunk teenager. It may be fine, but it's more likely to end in tears.

What VCs Don’t Understand About Vibe Coding

If you're not using AI tools to code in 2025, you're just behind the curve. But vibe-coding on an existing codebase is like handing the keys to your car to a drunk teenager. It may be fine, but it's more likely to end in tears.

AI tools such as Claude Code, Github Copilot, and Cursor's autocomplete offer substantial productivity gains. For me, I would estimate I code 10-20% faster on average with these tools enabled. On a recent internet-less flight, I became acutely aware how much I have come to rely on Cursor's autocomplete as I sat there waiting for auto-generated unit tests that were never going to come.

Nevertheless, it's important to know the outer bounds of these tools. We at Reffie are a Python+FastAPI shop for backend, so my post will concentrate on that language, but most of this should generalize. Over the past 8 months, I have used Github CoPilot, Windsurf IDE, Cursor IDE, and now Claude Code continuously. This post applies to all these tools, and likely the one you're using as well.

A Primer on Vibe Coding

If you're not familiar, coding agents have (roughly speaking) three* modes:

  1. Autocomplete / copilot. Much like Superhuman offering to autocomplete a sentence in an email, Copilot will autocomplete the current line, the current simple function, or the current unit test while you're drafting.
  2. Agentic editing with human approval. You chat with a built-in chat prompt (imagine embedded ChatGPT), and it offers suggestions and improvements to your code. You manually review and approve all changes generated. Think of this like Tesla autopilot - the car is driving itself, but you're holding the wheel and are ready to intervene in case you get in trouble.
  3. Vibe-coding. The same as above, except no approval is necessary. The AI is giddily galloping through and live-editing a range of files and in some cases also running commands on your machine without human restraint. You're handing AI the wheel.

Asking an AI agent to take the wheel is like asking a particularly disengaged summer undergrad intern (likely sleep-deprived from the night before) to merge straight to production without review. Sure, you can. And the code they'll write will probably work on their machine. But, much like repeatedly smashing the espresso button to make a full cup of coffee, should you?

In practice, vibe-coding is great for rapid prototyping, early mock-ups, proofs of concept, or quick one-off scripts (caveats below).

But vibing on a production codebase is actually insane. It's tapdancing on a tightrope with no safety net.

Vibe Coding on an Existing Codebase

So many "reviews" of vibe-coding center on new projects or hobby coding. But oddly absent from the discussion is vibe-coding for professionals: engineers who work every day with a product that has tens of thousands, or maybe hundreds of thousands, of lines of code already written. And that tradecraft looks very different.

I won't harp on about vibe-coded apps that leaked all their users' data. I will instead focus on what I think are the big differences between greenfield and extant codebases, and why AI tools do a particularly poor job on the latter.

Where do we spend time as developers?

When working with an existing codebase, according to my team's internal metrics (we use Linear for ticket tracking and time estimation) we spend 3x as much time on fixing bugs, improving performance, and refinining the architecture compared to actually writing v1 of the feature. Consequently, the most import features of our codebase are:

  1. The code is modern and idiomatic for the language & framework we're using. Each library and language has a recommended way to perform common operations, but you can do things the wrong way if you like. Language conventions can also change with version updates. Updating library versions is also its own kind of hell, introducing some features and deprecating others. We want to avoid writing code that is already deprecated by the time we write it, because someone will have to clean that up later (often that someone is me).
  2. Easy to read rather than fast to write. Internal consistency is important.
  3. If the code is hard to debug, it is a net time suck.
  4. Reuse code: write it once then use it everywhere. Chances are, someone already wrote it, and wrote it better. Conversely, your new code almost certainly has bugs.

AI agents fail catastrophically at all four points.

Common anti-patterns in Python code

AI agents tend to write Python code that features a number of anti-patterns. I think this is because a lot of code available online is either starter code, or code written by people just learning how to program. Since the AI was trained on bad code, it generates bad code in turn. I have heard anecdotally that code quality may be higher or lower depending on language. Below, I will highlight some bad code styles I have seen Claude generate in the past month:

Catching bare exceptions

try:
	do_stuff()
catch Exception as e:
	...

This is considered an anti-pattern because you're catching anything that can go wrong. Some of the things that go wrong, including errors that might make more sense to surface than to fail on, including: programmer errors (in Python, you can catch some types of syntax errors like this), system errors (network connectivity issue), or totally unexpected errors that you didn't mean to catch at all. This pattern often masks errors and hinders debuggability.

f-string formatted logging statements with computation

logging.debug(f"This is the result of some computation: {my_func(thing)}")

In Python, logging statements are lazy-evaluated by default, while f-strings are evaluated eagerly. So, this means that the string above will always be computed even if the log level is below threshold. In contrast, specifying the computed in the argument is lazy-evaluated, meaning it will compute only if required. So to contrast:

logging.debug("Result: %s", heavy_compute(1, 2, 3))

The heavy compute function will only evaluate if the log level is set to debug, and will not execute otherwise. This sometimes comes up when you're trying to log deep diagnostics about why something went wrong.

SQLAlchemy old style code

SQLAlchemy is the premier ORM for Python. Version 2 has been released since Sep 2024, meaning that the vast majority of code online is for version 1.4. But that code style is deprecated. So consistently if you ask Claude to generate some database queries, it will generate code like this:

query = db_session.query(Table).filter_by(Table.id=3).one()

But note that in 2.0, the entire query API is deprecated. The equivalent using 2.0 syntax looks like this:

# option 1
query = db_session.execute(
	select(Table).where(Table.id == 3)
).scalar_one()

# option 2 - may return null but in general preferred for primary key lookups since it also checks the memory of the current session and will not perform a query if the item is present
query = db_session.get(Table, 3)

You can read more about Session.get on the docs page.

Python old style code

The most recent release of Python is 3.14, but Python 3.10 has been out since 2021. Among the conventions and features introduced in Python3.10 are union types and native support for annotated type hints. So while Claude generates code like this:

@dataclass
class Foo:
	item: Optional[str] = None
    

from typing import Dict, List

def to_json() -> Dict[str, List[int]]:
    ...

The modern way (since 2021) to write that is:

@dataclass
class Foo:
	item: str | None = None
    

def to_json() -> dict[str, list[int]]:
    ...

These are just a few examples that I've observed on code review, but I'm sure others can point out more. The point is that the models don't really have context on what is deprecated and what is modern, and will just use whatever.

Refinements & Long Functions

Once Claude generates code, refinements go like this:

  • "Looks like the code doesn't handle this case. Fix it"
  • "OK I've added that"
  • "OK but what about this?"
  • "Now I've added that too"

After a few back and forths, the code will feature gigantic, infinitely nested functions that are Optimus Prime functions - they do everything and nothing. Impossible to test, impossible to debug, and genuinely unreadable.

Existing Code, Conventions & Reuse

Conventions go beyond just linting and testing. They're a flow that your team has. When your team flows together, you can move much faster.

Conventions are important when you have a team rather than just a single developer. They speed up development time because each developer knows what a function should be called. They know the argument order. They can guess at the keyword argument names. They can guess at where a helpful class or utility will be located in the file tree.

Experienced developers will try to write code in the style of the current project, while new developers can be taught. The problem with Claude, like a bad intern, is it doesn't know and doesn't want to learn.

It will constantly write code that looks completely foreign to the codebase. It will switch argument orders. It will re-implement functions where helpers exist in the codebase (often badly).

Does CLAUDE.md help?

No! Claude will make the same mistakes over and over, and codifying them into CLAUDE.md only further infuriates our developers by asking Claude to read it, Claude acts sorry, and then commits the same crimes in the next prompt.

The Anthropic team has some great advice here that can be summarized as "if the AI makes a mistake, throw it all out and start again instead of asking it to fix". Some developers have anecdotally gotten better results when throwing out their CLAUDE.md files entirely. YMMV. ¯\_(ツ)_/¯

What's the takeaway? Should I stop using AI?

No! First, it’s clear based on the last 8 months that AI-generated code is rapidly improving. In January of 2025, the code written by the Copilot agent was hot garbage and in March the code generated by Windsurf was barely usable. Check back in another 8 months and it’s likely Anthropic and friends will have found a way to address some of these issues (God I hope so).

In the meantime, you have to read every line of code the AI generates. Treat the AI like an unreliable undergrad summer student. It’s not invested. It doesn’t care if it brings down production or it deletes some files. It goes back to school in September either way and will never see you again.

That's why my team is in-person. If you bring down production, you should at least have to endure some awkward looks. 👀 IMO, if you can't shame an AI then you can't trust it.

About Us

If you love using AI tools but care about building things beautifully the first time, we’re hiring. DM me or email careers@reffie.me. Please note that all our roles are in-person in Toronto (because if I can’t stare deeply into your soul after you submit a 3k line PR then what’s even the point of being a startup founder).