Beyond vibe coding: a multi-agent pipeline for code conversion

Me included in Software Development Artificial Intelligence

April 11, 2026 3692 words 18 minutes

Contents

Intro

A few weeks ago I ran a very simple test: I took TinyDB, opened a coding assistant, and asked it the most obvious thing in the world: convert this to Rust.

At first glance, the result did not even look that bad. The code looked plausible, it compiled, it passed a few tests, and it gave off that dangerous feeling of “I think we’re basically there.” But as soon as I started looking closely, the real problems surfaced: a deadlock in insert_multiple, silent integer-to-float corruption in update operations, a panic path on invalid regex, and roughly 40% of the original package features simply gone.

The problem is not that LLMs “cannot convert code.” It is that converting a non-trivial codebase is not a single generative task. It is not “write Rust instead of Python.” It is “reconstruct structure, dependencies, design intent, migration order, invariants, tests, and only then generate code.”

In other words: I am not really criticizing the model itself. I am criticizing the shape of the work we ask it to do.

That is why I tried a different approach: instead of delegating everything to one pass, I first enriched the codebase with structured knowledge and then let a small team of specialized agents use that base to plan and execute the conversion.

The thesis of this article is simple: for non-trivial code conversions, the big jump in quality does not come from a smarter prompt. It comes from decomposing the problem and using deterministic tools wherever possible.

In my previous article, I had already talked about Spec-Driven Development. Here I want to apply that same reasoning to a very concrete case: porting a real library from Python to Rust.

Why single-shot conversion breaks

Over the last few months, coding agents have improved a lot. On small codebases or tightly scoped tasks, the “vibe coding” approach works better than many people expected.

But when the target is a real library with multiple modules, a public API, implicit behavior, and non-trivial architectural constraints, the cracks show up fast.

In my tests, the most recurring failure modes were these:

Hallucinated dependencies: the model invents relationships that do not exist or loses the real import and call topology.
Confused planning: helpers, tests, core modules, and secondary details all end up on the same level.
Cross-file inconsistency: a type defined in one file no longer matches the abstraction used in another.
Token waste: the model spends context rediscovering structural facts that a parser could extract deterministically.

Code conversion is a composite task. Some parts are deterministic, some are architectural, some are exploratory, and only part of it is truly generative.

Treating all of this as one monolithic prompt means asking the same model to:

understand the structure of the codebase,
infer the original design,
estimate dependencies,
decide the migration order,
generate the code,
verify on its own whether what it wrote actually makes sense.

That is too much. Not because the model is “bad,” but because we are compressing a problem made of very different phases into a single pass.

The approach: enrich the codebase before converting

The solution I tried does not start from the prompt. It starts from the codebase.

Before asking an LLM to generate code in the target language, I build two knowledge layers:

a graph database in Neo4j that stores the structural skeleton of the source codebase: modules, classes, functions, methods, variables, imports, calls, inheritance, and containment;
a vector store in LanceDB that contains documentation and examples for the library, used in RAG mode.

Then I make those tools available to a coordinated team of agents:


flowchart TB
    subgraph "Knowledge Layer"
        A[Python codebase] -->|AST parsing| B[Neo4j graph]
        D[PDF or Markdown docs] -->|Chunk + embed| E[LanceDB vector store]
    end

    subgraph "Agent Layer"
        O[Orchestrator] --> CA[Code Analyzer]
        O --> BA[Builder]
        O --> VA[Verifier]
        CA -->|Cypher queries| B
        BA -->|Semantic search| E
        BA -->|Code generation| F[Generated Rust code]
        VA -->|Compile + review| F
    end

    O -->|Conversion plan| BA
    VA -->|Revision feedback| BA

The advantage is not having “more agents” by itself. The advantage is giving each stage a narrower responsibility and a better toolset.

I am not giving the model maximum freedom. I am deliberately reducing its decision space in the places where that freedom is most likely to produce expensive mistakes.

Agent	Role	Main tools	Why this choice works
Orchestrator	Coordinates the workflow and delegates tasks	File tools, task orchestration	Strong model for planning and decomposition
Code Analyzer	Produces a conversion plan	Neo4j tools, Cypher queries, complexity analysis	Smaller model is enough because it reads structured data
Builder	Generates Rust modules	Vector search, code generation	This is where a stronger coding model really pays off
Verifier	Checks correctness and proposes fixes	Compiler checks, semantic review	The same coding model, but focused on review

The key design choice is this: not every role deserves the same model.

The Analyzer does not need a frontier model if it is querying a graph that has already been built. The Builder is exactly where it makes sense to spend more, because that is the genuinely generative step.

The four steps of the pipeline

1. Build the graph

The first step is a static pass over the AST of every .py file in the repository. No LLM is involved here.

It is deterministic, cheap, and fast. Of course, Python static analysis has obvious limits when behavior gets highly dynamic, but for a codebase like TinyDB it captures most of the topology that the planning phase actually needs.

# kb_builder/python_graph_parser.py (simplified)
class PythonGraphParser:
    def parse_repository(self):
        python_files = list(self.repo_path.rglob("*.py"))
        for py_file in python_files:
            self._parse_file(py_file)
        return self.nodes, self.relationships


class ModuleVisitor(ast.NodeVisitor):
    def visit_ClassDef(self, node):
        class_id = f"class:{self.module_name}.{node.name}"
        bases = [self._get_name(base) for base in node.bases]

        self.parser._add_node(Node(
            id=class_id,
            type="Class",
            name=node.name,
            properties={
                "full_name": f"{self.module_name}.{node.name}",
                "bases": bases,
                "docstring": ast.get_docstring(node),
            },
        ))

        for base in node.bases:
            base_id = self.resolve_name(self._get_name(base), "class")
            self.parser._add_relationship(Relationship(
                source_id=class_id,
                target_id=base_id,
                type="INHERITS",
            ))

The parser extracts seven node types: Module, Class, Function, Method, Variable, GlobalVariable, and ClassAttribute.

It also captures structural relations such as CONTAINS, IMPORTS, INHERITS, CALLS, DEFINES, USES, and DECORATES.

Once parsing is finished, the graph is bulk-loaded into Neo4j:

uv run python kb_builder/build_python_graph.py \
    --repo-path /path/to/target/python/repo

2. Verify and visualize the graph

Once the graph is in Neo4j, I can inspect it in two ways:

with targeted queries, for precise checks;
with graph visualization, to get a high-level map of the codebase.

Here are two simple examples:

-- Find the most complex classes
MATCH (c:Class)
OPTIONAL MATCH (c)-[:DEFINES]->(m:Method)
OPTIONAL MATCH (c)-[:DEFINES]->(a:ClassAttribute)
WITH c, count(DISTINCT m) AS methods, count(DISTINCT a) AS attributes
WHERE methods + attributes > 5
RETURN c.name, c.full_name, methods, attributes
ORDER BY (methods + attributes) DESC

-- Trace call chains up to depth 3
MATCH path = (f1)-[:CALLS*1..3]->(f2)
WHERE (f1:Function OR f1:Method) AND (f2:Function OR f2:Method)
RETURN f1.full_name AS caller, f2.full_name AS callee, length(path) AS depth
LIMIT 25

Figure 1 - High-level view of the TinyDB graph in Neo4j

Figure 2 - Zoomed view of relationships inside the graph

This stage is useful even before the actual conversion starts. It shows which modules are central, which ones are leaves, and where the dependency bottlenecks sit. That makes planning a realistic build order much easier.

3. Load documentation into the vector store

To generate sensible code, the Builder needs more than source code. It also needs documentation, usage examples, and API details.

This is where the vector store comes in. The flow is the usual RAG one: load PDF or Markdown docs, split them into chunks, embed them, and save everything in LanceDB.

# kb_builder/extract_text_from_pdf.py (key steps)
loader = PyPDFLoader(str(doc_path))
pages = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=int(os.environ["SPLIT_CHUNK_SIZE"]),
    chunk_overlap=int(os.environ["SPLIT_CHUNK_OVERLAP"]),
)

chunks = text_splitter.split_documents(pages)
vectorstore.add_documents(chunks)

For public libraries, Context7 is often even better, because it provides curated, up-to-date documentation without the overhead of building a local RAG pipeline.

That said, local RAG with LanceDB or another vector database still makes sense when you are working with private repositories, internal frameworks, or company documentation that is not available on public services.

4. Run the conversion

With the graph DB and the documentation in place, I can launch the actual pipeline:

bash deepagents.sh \
    --project tinydb \
    --target-language rust \
    --show-trace \
    --show-token-usage \
    --show-model-thinking \
    --write-run-report

Under the hood, the run_conversion.py entrypoint wires together an Orchestrator and three subagents using LangChain DeepAgents:

# run_conversion.py (simplified)
subagents = [
    CompiledSubAgent(
        name="code_analyzer",
        description="Analyze the codebase and create a conversion plan.",
        runnable=analyzer.agent,
    ),
    CompiledSubAgent(
        name="builder_agent",
        description="Generate target-language code for a specific component.",
        runnable=builder.agent,
    ),
    CompiledSubAgent(
        name="verifier_agent",
        description="Verify generated code and suggest revisions.",
        runnable=verifier.agent,
    ),
]

agent = create_deep_agent(
    model=orchestrator_llm,
    system_prompt="You are a code conversion assistant...",
    subagents=subagents,
    backend=build_backend_factory(memory_root),
    name="orchestrator",
)

Each role is independently configurable through environment variables, so the pipeline can assign different models to different jobs:

# .env (example)
ORCHESTRATOR_LLM_PROVIDER=bedrock
ORCHESTRATOR_LLM_MODEL=global.anthropic.claude-opus-4-5-20251101-v1:0

ANALYZER_LLM_PROVIDER=bedrock
ANALYZER_LLM_MODEL=global.anthropic.claude-haiku-4-5-20251001-v1:0

BUILDER_LLM_PROVIDER=bedrock
BUILDER_LLM_MODEL=global.anthropic.claude-sonnet-4-5-20250929-v1:0

VERIFIER_LLM_PROVIDER=bedrock
VERIFIER_LLM_MODEL=global.anthropic.claude-sonnet-4-5-20250929-v1:0

A few implementation details

The Analyzer: planning from Neo4j

The Analyzer does not read raw Python files. It talks to an MCP server I built specifically to access Neo4j. The MCP tool exposes targeted operations such as describe_graph_structure, query_graph, get_code, get_class_hierarchy, analyze_complexity, and get_module_dependencies.

# agents/code_analyzer.py
class CodeAnalyzer(BaseAgent):
    def __init__(self, config, *, enable_reasoning=False):
        tools_project = Path(__file__).resolve().parent.parent / "code_converter_tools"
        self.mcp_client = MultiServerMCPClient({
            "neo4j_tools": {
                "transport": "stdio",
                "command": "uv",
                "args": ["--directory", str(tools_project), "run", "neo4j_tools.py"],
            }
        })

This changes the planning problem quite a bit. Instead of telling the model “read and understand this repository,” I can let it ask much more focused questions, such as:

which modules are leaves in the dependency graph?
which classes have the highest structural complexity?
what inherits from Storage?
which methods call insert or search?

For the Orchestrator, this is a huge difference: it starts from a queryable base of structural facts rather than a pile of files.

I also added middleware to limit tool usage and summarize long threads before they become too expensive:

def _initialize_middleware(self):
    return [
        ToolCallLimitMiddleware(tool_name="query_graph", thread_limit=100, run_limit=30),
        ToolCallLimitMiddleware(tool_name="get_code", thread_limit=500, run_limit=300),
        SummarizationMiddleware(
            model=self.llm,
            trigger=[("tokens", 200000)],
            keep=("messages", 20),
        ),
    ]

The Builder: retrieve context and generate code

The Builder mainly does two things:

retrieve relevant documentation;
generate the target module with that context.

In simplified form, the core tool looks like this:

# agents/builder_agent.py (simplified)
@tool
def generate_code_snippet(input_json: str) -> str:
    input_data = json.loads(input_json)
    python_code = input_data.get("python_code", "")
    description = input_data.get("description", "")

    docs = self.vectorstore.similarity_search(search_query, k=3)
    doc_context = "\n\n".join([doc.page_content for doc in docs])

    prompt = f"""Convert the following Python code to {self.config.target_language}.
Python Code: {python_code}
Description: {description}
Relevant Documentation: {doc_context}
Please provide the converted code with explanatory comments."""

    response = self.llm.invoke(prompt)
    return response.content

The prompt matters less than the context here. The Builder reaches this step with the problem already narrowed down: it knows what it has to convert, which dependencies matter, and which documentation context it should use.

The Verifier: compiler checks and semantic review

The Verifier handles two main things:

compiler checks: compile the generated code and capture syntax and type errors;
semantic review: check that logic and API remain consistent with the plan produced by the Analyzer.

# agents/verifier_agent.py (simplified)
@tool
def check_syntax(code: str) -> str:
    with tempfile.NamedTemporaryFile(suffix=".rs", delete=False) as tmp:
        tmp.write(code)
        tmp_path = tmp.name

    result = subprocess.run(
        ["rustc", "--crate-type=lib", "--error-format=json", tmp_path],
        capture_output=True,
        text=True,
        timeout=30,
    )

    return "Syntax OK" if result.returncode == 0 else result.stderr

It is worth stating this plainly: the generator’s self-confidence is not a verification strategy. A migration pipeline needs compiler checks, semantic review, and, if possible, executable tests.

Case study: converting TinyDB

Why TinyDB

I was looking for a Python project to migrate to Rust that was large enough to expose the limits of a single-shot approach, but not so large that the experiment became unmanageable.

TinyDB seemed like a good choice because, despite the source code consisting of about a dozen files, in terms of dependency graph topology, it has enough richness to test the multi-agent pipeline: we’re talking about 352 nodes and 413 relationships. This is relevant because it is by no means certain that the codebase must be translated one file at a time.

In short, then, I chose TinyDB because it is:

non-trivial: it has multiple modules, a query DSL, storage abstractions, and cross-module dependencies;
manageable: it is small enough to stay within reasonable time and cost;
well-documented: public documentation exists and is easy to ingest;
well-tested: the original test suite gives a useful behavioral reference.

In short: small enough to finish the experiment, complex enough not to collapse into a toy example.

The experiment

The multi-agent run took about 18 minutes and consumed roughly 2.95 million tokens.

Role	Model	Calls	Input Tokens	Output Tokens	Total
Orchestrator	Claude Opus 4.5	40	2,362,138	61,820	2,423,958
Code Analyzer	Claude Haiku 4.5	9	200,240	16,646	216,886
Builder	Claude Sonnet 4.5	39	230,783	81,763	312,546
Verifier	Claude Sonnet 4.5	1	1,545	429	1,974

One thing is immediately clear: the biggest cost is not in the Builder, but in the Orchestrator. The pipeline adds structure, but it also adds non-trivial coordination overhead.

The execution trace captured 248 entries and 112 tool calls. The overall pattern looked like this:

the Orchestrator delegates the analysis to the Code Analyzer;
the Analyzer queries the graph to identify modules, classes, dependencies, and high-complexity components;
the Analyzer returns a dependency-aware conversion plan;
the Orchestrator launches five Builder tasks in parallel for the leaf modules: error, utils, storages, operations, and queries;
once those builders finish, the Orchestrator launches the dependent modules: table, database, middlewares, plus lib.rs and Cargo.toml;
the Verifier reviews the generated output;
the Orchestrator writes the final code and report artifacts.

You can see the parallel dispatch of the leaf modules clearly in the logs:

[tool:start] task | builder_agent: "Generate error.rs"
[tool:start] task | builder_agent: "Generate utils.rs"
[tool:start] task | builder_agent: "Generate storages.rs"
[tool:start] task | builder_agent: "Generate operations.rs"
[tool:start] task | builder_agent: "Generate queries.rs"
[agent:start] BuilderAgent (x5)

The generated output

The pipeline produced a complete Rust project with 9 source files, 4,081 lines of code, 147 tests, and a couple of examples:

tinydb-rs/
  Cargo.toml
  README.md
  src/
    lib.rs
    error.rs
    utils.rs
    storages.rs
    queries.rs
    operations.rs
    table.rs
    database.rs
    middlewares.rs

One example of the generated design is the generic TinyDB<S: Storage> API:

use crate::error::Result;
use crate::storages::Storage;
use crate::table::{Document, Table, UpdateFields};

pub const DEFAULT_TABLE_NAME: &str = "_default";
pub const DEFAULT_CACHE_SIZE: usize = 10;

pub struct TinyDB<S: Storage> {
    storage: S,
    default_table_name: String,
    cache_size: usize,
}

impl<S: Storage> TinyDB<S> {
    pub fn new(storage: S) -> Self { /* ... */ }

    pub fn insert(&self, document: &Value) -> Result<u32> {
        self.default_table().insert(document)
    }

    pub fn search(&self, query: &dyn QueryLike) -> Result<Vec<Document>> {
        self.default_table().search(query)
    }
}

Multi-agent vs vibe coding

To see whether all this extra structure actually produced a real advantage, I compared the multi-agent output with a much simpler baseline: a single coding-assistant run using the same generation model as the multi-agent Builder.

The baseline prompt was as direct as possible:

Convert the TinyDB Python package to Rust.

To keep the comparison fair, that single-shot run used the same model as the multi-agent Builder: Claude Sonnet 4.5.

I compared the two outputs in two ways:

with an assessment delegated to Opus 4.6 (reasoning max);
with manual code inspection, to validate the main architectural points and the most important bugs.

I do not consider this a scientific or independent benchmark. It is a practical experiment on a specific codebase, useful for understanding patterns and trade-offs, not for declaring a universal winner.

Codebase metrics

Metric	Multi-Agent	Vibe-Coded	Delta
Lines of code	4,081	1,937	2.1x more
Total tests	147	39	3.8x more
Query operators	14	9	56% more
Update operations	11	6	83% more
Doc-tests	44	9	4.9x more
Working examples	2	0	Multi-agent only

From this angle, the multi-agent result was clearly more complete.

Architectural differences

The differences were not just quantitative. There were also two fairly sharp design choices.

The first concerns the storage abstraction model.

The multi-agent version used generics:

TinyDB<S: Storage>

The vibe-coded version used dynamic dispatch with synchronization wrappers:

Arc<RwLock<Box<dyn Storage>>>

The generic solution is more idiomatic in Rust and, as you would expect, performed better in read-heavy paths thanks to monomorphization. The dynamic-dispatch version is easier to compose and, in these tests, turned out to be faster on writes.

The second major difference concerns concurrency:

the multi-agent version used RefCell internally and then marked some types as Send and Sync with unsafe impl, which is clearly wrong and led to a critical unsafety bug;
the vibe-coded version used RwLock, a much more reasonable primitive from a thread-safety perspective, but its implementation deadlocked inside insert_multiple.

So neither version was bug-free. They just failed in different ways.

Main bugs found

Issue	Multi-Agent	Vibe-Coded	Severity
Unsound `unsafe impl Send/Sync` on `RefCell`	Yes	No	Critical
Deadlock in `insert_multiple`	No	Yes	Critical
Silent integer-to-float coercion	No	Yes	High
Panic on invalid regex	No	Yes	High

Performance benchmarks

The write-heavy benchmarks favored the vibe-coded version, with an advantage close to 2x:

Benchmark	Multi-Agent	Vibe-Coded	Winner
Bulk Insert (1000)	796.76 us	436.20 us	Vibe-coded
Update Query (2000)	3.31 ms	1.83 ms	Vibe-coded

The read-heavy benchmarks, by contrast, favored the multi-agent version:

Benchmark	Multi-Agent	Vibe-Coded	Winner
Read All (2000)	1.01 ms	1.52 ms	Multi-agent
Search Eq (2000)	112.66 us	120.75 us	Multi-agent

If I looked only at the raw benchmark scoreboard, I could say the vibe-coded version “wins” 6-4, mostly because the write-heavy tests carry a lot of weight.

But stopping there would be misleading. The multi-agent output came with more tests, broader API coverage, executable docs, and fewer correctness issues. Its most serious flaw is concentrated in a single wrong concurrency decision. The vibe-coded version is leaner and faster on writes, but it would require substantial work to recover the missing features and close the correctness bugs.

A surprise about thread safety

There was one passage that made me change my mind while I was analyzing the results.

At first glance, the vibe-coded version looked simply better on the conceptual level: RwLock is a sensible Rust primitive, while RefCell plus unsafe impl Send/Sync is an obvious mistake.

But then I took a closer look at the documentation of the original Python library and noticed an important detail: TinyDB was not designed for thread-safety. In fact, the documentation explicitly lists “access from multiple processes or threads” among the reasons not to use TinyDB.¹

In the original project there is no real synchronization strategy. The core read-modify-write flow is unprotected, and concurrency is essentially out of scope.

This means that:

the multi-agent version, despite the unsafe mistake, was actually closer to the original design intent, because it remained conceptually single-threaded;
the vibe-coded version introduced a stronger concurrency model, which was a requirement not present in the original version.

While the thread-safe approach is certainly desirable, the result is that the multi-agent version was more faithful to the original design. The lack of specifications in the “vibe” approach gave the agent a much wider exploratory space, in which it made choices that were sound, but divergent from the original project.

Lesson learned: a port should not be evaluated only on what the target language makes possible, but also on what the source project actually intended to be.

What I take away from this experiment

Rather than declaring multi-agent the winner, this experiment clarified six things for me.

1. Structural facts should be extracted with deterministic tools

If the AST can tell you which modules import what, which classes inherit from which bases, and which functions call each other, there is little point in paying expensive tokens to make an LLM infer it.

2. Model choice should be made per role, not by fashion

The Analyzer can be small and cheap because it reads structured data. The Builder is where a stronger model actually creates value. Using the same “top” model everywhere is often just waste.

3. Making the dependency graph explicit unlocks real parallelism

Once dependencies are explicit, the Orchestrator can build independent modules in parallel, starting from leaf nodes and then moving up toward more central modules. That is an advantage a single-shot conversion, by definition, cannot have.

This benefit would probably become even more visible on codebases much larger than TinyDB.

4. Verification is not optional

Compiler checks, semantic review, and tests are not a luxury. They are the minimum needed to distinguish a plausible port from a reliable one.

5. Documentation matters almost as much as code

When it exists, Context7 is probably the most practical route. When it does not, a local vector store remains the universal fallback, especially in enterprise settings or on internal frameworks.

6. Reducing the search space improves reliability, but does not guarantee the optimum

The multi-agent approach narrowed the model’s room to maneuver and pushed it toward a more conservative solution that stayed closer to the original design. The vibe-coded version explored more freely. It dropped many features and made serious mistakes, but in a few places it also made bolder decisions that were locally more effective for Rust.

So no: more control does not automatically mean better design in the absolute sense. It mostly means output that is easier to govern, verify, and reason about.

Limits of the experiment and next steps

This project is still exploratory, so it is worth being explicit about what it does not show: it does not show that multi-agent is always better than single-shot, that this pipeline is ready for any Python codebase, or that orchestration cost is already optimized.

What it does show, at least in this case, is that adding structure before generation leads to output that is more complete and closer to the original design.

The next steps that make the most sense to me are these:

larger codebases: repeat the experiment on something much bigger than TinyDB;
tighter verify-revise loops: run the Verifier after each module instead of only once at the end;
more target languages: the framework already supports multiple compilers and should be tested beyond Rust;
lower orchestration cost: most tokens were consumed by the Orchestrator, so there is room to reduce coordination overhead.

The full source code, benchmark report, execution trace, and generated Rust crate are available in macc.

If you are running similar experiments on code conversion, this is the comparison I care about: not “which prompt works better,” but which architecture actually reduces the ambiguity of the problem.

TinyDB documentation, “Why Not Use TinyDB?”, which explicitly lists access from multiple processes or threads as a non-goal: https://tinydb.readthedocs.io/en/latest/intro.html ↩︎