By Nikolay Popov,
Senior .NET Software Engineer, Intellectsoft
There’s a line in Ethan Mollick’s Co-Intelligence that I keep thinking about. He’s a Wharton professor and his observation is this: most people who try AI and decide it’s not useful tried it wrong. They handed it a task, got a mediocre result, and concluded the technology was overhyped. What they missed is that AI doesn’t slot into broken workflows and fix them. It intensifies them.
That describes almost every failed AI project I’ve been close to. And I’ve been close to enough of them to know the pattern.
Three years ago I was a .NET developer. Someone suggested switching to Python. I figured it would look good on my portfolio, so I said yes. Then the first AI task landed on my desk, and my thinking about software changed in a way I didn’t fully expect.
Building an Iterative Research Pipeline
The task was to build a private deep research system over internal company documents – not a simple RAG pipeline, though people often confuse the two. A basic RAG retrieves chunks and generates an answer in a single pass. What we built was iterative and agentic, grounded in research that was emerging at the time around retrieval-augmented generation, chain-of-thought reasoning, tool use, multi-step planning, and web-based QA systems.
This was also a different era in terms of model capabilities. Context windows were eight thousand tokens. There was no support for millions of tokens in a single call. And we were working over internal corporate documents – not the open web. A real enterprise can accumulate millions of internal documents over years of operation. You can’t fit that into a context window. You have to build a system that knows how to search, evaluate, and iteratively refine.
The system started by clarifying intent – understanding what the user actually needed before touching the knowledge base. From there it planned the retrieval strategy, ran hybrid search with semantic and keyword matching, reranked and compressed the results, then evaluated whether what it found was sufficient to produce a grounded answer. If it wasn’t – and often it wasn’t, because real research questions rarely resolve in one pass – the system reformulated the query, retrieved additional context, embedded new information on the fly, and ran the cycle again. Each iteration refined both the query and the accumulated evidence.
Human-in-the-Loop as a Design Principle
When the system hit genuine ambiguity – a question it couldn’t resolve with the available data – it escalated to the user rather than guessing. Human-in-the-loop wasn’t an afterthought; it was a designed transition point in the pipeline.
What Made It Different
This was roughly six months before ChatGPT or anything comparable. By the time everyone else was building single-pass RAG, we had already learned why a single pass isn’t enough for real research workflows over large document collections.
Since then I’ve worked across diverse AI automation projects: a clinical document processing system for a medical client in Japan and a natural-language query interface that replaced a manual GraphQL workflow. Each project taught me something different about where these things tend to break, and what makes the difference between a system that reaches production and one that never makes it past the demo.
Case Study: Clinical Document Processing in Japan
This is the project that best illustrates how deceptively complex “simple” automation actually is.
What the workflow looked like before
A patient arrives at a clinic carrying a printed lab report from a diagnostic facility. The doctor takes that printout, opens their software, and manually transfers every test result in, one value at a time. Then they look at all those values and manually write up what they see – a brief summary. It works. It’s also slow, error-prone, and capped at whatever patterns the doctor can hold in their head.
What we built
A practical question came up from the business side: why build these provider templates by hand if it takes time and resources? So we built a separate agent that generates templates automatically from an existing document. Feed it a sample lab report from a new provider, and it maps the structure itself. Onboarding a new diagnostic provider went from a manual task that took days to something that ran on its own.
The first challenge beyond that wasn’t the AI – it was that “a document” isn’t one thing. We had to handle five fundamentally different input types, each requiring its own extraction strategy.
Plain images – photos of printed reports, sometimes taken at an angle, sometimes poorly lit. Pure OCR territory with all the noise that comes with it.
Unstructured PDFs – machine-generated, but with no predictable layout. We know it’s a lab report, but we don’t know where any specific value lives on the page. Every provider arranges the same tests differently.
Structured PDFs – documents from known providers where elements are in fixed, predictable positions. A template maps directly to the layout, and extraction is precise.
OCR PDFs – the tricky category. These look structured but were generated by an external system through an unknown process. The internal representation is unreliable, so we can’t trust the PDF layer even though we know what the document should contain. We treat these as images that happen to be wrapped in a PDF container.
Audio – sometimes as a supplement to a PDF, sometimes embedded in the file itself. A doctor’s voice notes accompanying the lab results. Audio notes are transcribed and cross-referenced with the extracted lab values – if a doctor dictates “blood sugar looks elevated,” the system validates that against the actual reading from the report.
The system classifies each incoming document first, then routes it through the appropriate extraction pipeline. For critical values, we run redundant extraction – multiple independent strategies on the same document – and cross-validate. Why not a single pipeline with confidence scoring? Because a single method’s confidence score can’t catch its own systematic errors. If OCR misreads “1.4” as “14,” it will report high confidence in the wrong value. An independent extraction path catches that. If two methods agree on a value, it enters the system automatically. If they disagree, it goes to a review queue. The computational cost of redundancy is real, but in clinical data, a silent error isn’t a performance issue – it’s a patient safety issue.
After extraction and normalization, the system generated what we internally called a hypothesis layer – essentially a differential diagnosis support tool. It surfaced the conditions most consistent with the patient’s combined biomarkers, ranked by likelihood given their age and profile. The doctor reviews, agrees or disagrees, and investigates anything unfamiliar. The Japanese team continued building on this, grouping certain tests together, looking at how blood pressure and blood cholesterol interact and what that combination might suggest given the patient’s age.
Why this matters beyond speed
Doctors absorb an enormous amount in their first decade of practice. Then the knowledge starts to erode. Not from negligence – staying current while seeing patients every day simply isn’t realistic. Researcher David Densen has documented that the half-life of medical knowledge is roughly five years. The BMJ has published work showing it takes around seventeen years for new clinical evidence to reach practicing doctors. Seventeen years. That’s not a people problem. That’s a broken pipeline between research and the clinic.
The system sidesteps that problem entirely. It has no knowledge decay. It can show the doctor a combination of biomarkers that they have never seen before, prompt them to look it up, and lead to a referral that wouldn’t have happened otherwise.
The doctor sees: oh, this patient may have this condition. I’ve never even heard of it. They look it up. Right, these indicators – this could indeed be real. I need to look into this further, bring in a specialist, and refer the patient. That isn’t a replacement. That’s the doctor becoming more capable than they were yesterday.
That’s what good clinical automation does. It doesn’t replace the doctor’s judgement. It expands what the doctor can see.
Case Study: From GraphQL to Natural-Language SQL
This one was a bit bittersweet.
What I built originally
The company used GraphQL as a standard interface for accessing analytical data across multiple products. It was a good fit for structured queries with known schemas. I built the auto-generation layer: the system inspected the database at startup using C# with reflection and expression trees, built classes in memory, generated the full GraphQL schema – types, filters, sub-filters for child tables, pagination – and exposed a complete API without a single hand-written resolver. For a database with a hundred-plus tables, the schema was ready in seconds.
Expression trees handled one level of nesting cleanly – filtering on a parent table’s child relationships compiled down to efficient SQL joins. Beyond one level, the N+1 problem became unavoidable within the expression tree approach. That’s a real constraint we had to design around rather than pretend didn’t exist.
There was also a real memory management challenge. When your types, filters, and query objects are all generated in memory at runtime, the garbage collector doesn’t distinguish between objects you’re still using and the class definitions that created them – both are just allocations. We had to be deliberate about lifecycle management to keep the memory footprint stable under load.
How the system evolved
The system evolved in three phases.
First: analysts submitted requests, a developer manually wrote GraphQL schemas, built the queries, returned results. Turnaround: days.
Second: my auto-generation layer eliminated the manual schema work – hundreds of tables, full filter trees, no developer in the loop for schema maintenance. That was a genuine improvement. But the descriptions and business context around each field still required manual upkeep, and as the underlying data structures changed, that maintenance became the new bottleneck.
Third: we replaced the translation layer entirely with an LLM interface. An analyst types a question in simple English. The system converts it to SQL, runs it, returns a result. No schema maintenance at all. No developer in the middle.
Why we moved past GraphQL
Before jumping to Text-to-SQL, we explored an intermediate step: having the LLM analyze the GraphQL schema and generate GraphQL queries directly. It worked for simple lookups, but we hit the ceiling quickly. GraphQL is an abstraction layer over the database, and that abstraction limits what you can express. Complex joins, nested aggregations, conditional subqueries – the kind of analytical questions real users ask – either aren’t possible in GraphQL or require custom resolvers written by hand, which defeats the purpose of automation.
There was also a practical consideration: the system needed to integrate across multiple clients with different database structures. GraphQL schemas are tightly coupled to their specific data model. SQL is a universal language that works across all of them. Removing the intermediate layer and going directly to SQL gave the LLM access to the full expressive power of the database. That was the real unlock – not replacing GraphQL for the sake of it, but recognizing that for free-form analytical access across diverse data sources, the abstraction was creating a ceiling, not providing value.
The story didn’t end there. As the architecture matured, we moved to MCP – Model Context Protocol – where each data access method became a separate server that the agent can call as a tool. The Text-to-GraphQL approach that we initially abandoned didn’t disappear; it found its proper place as one capability inside an MCP server. GraphQL works well for structured, predictable queries against a known schema – it was never the right tool for free-form analytics, but it’s a perfectly good tool for its original purpose. Now the agent decides which tool fits the question: direct SQL for complex analytical queries, GraphQL through MCP for structured data access, OpenSearch for full-text search across document stores. What started as hardcoded elements in a monolithic graph evolved into independent agents, each specialized for its data source. The system got smarter not by replacing tools, but by learning when to use which one.
Safety and validation
When an LLM generates SQL, the obvious question is: what stops it from generating something destructive? Every generated query runs through validation – the system rejects mutations, limits scope to read-only operations, and logs cases where the returned data doesn’t match the analyst’s stated intent. That feedback loop is what drives continuous improvement in prompt accuracy.
The cost reality
Every query is an API call. At the demo level, that cost is invisible. As usage grew, it became a real constraint – one we hadn’t budgeted for. The detailed cost story is in the next section, but the short version: we had to learn cost optimization in production, not in planning.
But here’s what we got in return: any engineer on the team can maintain it. From a business lens, if you need to deploy something quickly and you need analysts to make judgments based on real data instead of waiting days for a query, the economics work.
Named debt vs invisible debt
The GraphQL system was brilliant engineering with no exit plan. Nobody asked: what happens when the team that built this leaves? That’s exactly what happened. The LLM replacement was a deliberate tradeoff. We knew what we were giving up and what we were getting. We named the debt before we took it on. That’s the difference that matters.
I built something technically sophisticated – runtime code generation, expression trees, memory-safe dynamic typing – and then led the effort to replace it with something simpler. The replacement is easier to maintain, cheaper to staff, and accessible to anyone on the team. Sometimes the right architectural decision means walking away from your own best work.
How We Actually Use AI Day to Day
AI has changed how we work across the entire development cycle. Here’s what that actually looks like in practice.
Context recovery
In consulting work, developers rotate between projects based on priority. You leave a codebase for a month or two, come back, and the context is gone. Before AI tooling, regaining that context meant three to four hours of navigating code, rereading tickets, tracing dependencies. Now it takes thirty minutes to an hour. The AI has the full codebase in context – you ask it what changed, what the current architecture looks like, where the last work left off, and you’re back up to speed. Over a year, across a team that rotates regularly, that’s hundreds of engineering hours recovered.
Drafting code from specs
We write formal specifications – inputs, outputs, constraints, edge cases – and the LLM generates the first working implementation. The developer’s job shifts from writing boilerplate to reviewing whether the generated code matches the spec. First drafts that used to take a day are ready for review in minutes.
Architecture conformance
We maintain reference documents describing system architecture, component responsibilities, logging standards, naming conventions. An automated process checks every pull request against those documents and flags drift – a method named wrong, a component used outside its intended scope, a logging pattern that doesn’t match the standard. Large codebases stay consistent without one person having to hold the entire system in their head. In consulting, where teams rotate and projects get handed off, this is the difference between a codebase that degrades over time and one that holds its shape.
Refactoring and migration
When migrating legacy code, the LLM handles the mechanical translation – converting patterns, updating APIs, rewriting syntax – while the developer focuses on the parts that require judgment: whether the migration changes behavior, whether edge cases are preserved, whether the new version actually improves on the old one.
Planning system changes
Before anyone writes a line of code, we describe the proposed change in plain language and have the LLM walk through the implications: what components are affected, what interfaces change, where the risks are. It’s not a replacement for architecture review – it’s preparation for it. The architect walks into the conversation with a first draft of the impact analysis already done.
Log analysis
In the test environment, the LLM scans logs automatically after each deployment, identifies anomalies, groups related errors, and produces a summary. Instead of a developer scanning thousands of lines looking for what went wrong, they get a structured report of what’s different from the last clean run. Debugging starts from a hypothesis instead of from raw noise.
Documentation
Documentation updates happen as part of the development cycle, not after it. When a component changes, the LLM generates an updated description based on the new code and the existing docs. The developer reviews and approves. Documentation that used to lag weeks behind the code now stays within a PR of current.
Test generation
You write a document describing what a feature should do – the expected behaviors, the edge cases, the failure modes. The LLM generates the test scaffolding: the setup, the assertions, the structure. The developer reviews, adjusts, and fills in the cases that require domain knowledge. The repetitive boilerplate is handled. That time goes back to writing the tests that actually catch bugs.
Pull request summaries
Before: a developer changes ten files and writes “updated code.” Now: the model reads the diff and produces a clear summary – what changed, why, what it affects. Reviewers start the review understanding the intent, not reverse-engineering it from the diff. Code reviews happen properly instead of getting waved through because nobody had time to understand the change.
The real shift
We moved from writing prompts to writing specifications. A prompt is a wish. A spec is a contract. Formal specs produce predictable, repeatable results. That’s what makes AI reliable at scale – not clever prompting, but engineering discipline applied to a new tool.
Seven Days to Ship. Six Months to Fix. A Demo Is Not a Product
When large language models became widely available, there was an expectation across the industry that development costs would drop dramatically. And in a sense, they did – demo projects started appearing at a speed nobody had seen before. But speed created its own problems.
The Speed Trap
A demo built with LLM-assisted code can look impressive and fail silently. In roughly one out of every four runs, we’d see outputs that looked plausible but were completely wrong – hallucinations that only surface when you check the results against real data. Many of these demo products also don’t scale, because the LLM generates code for the immediate request: here’s some data, process it, return a result. It solves the problem in front of it. It doesn’t think about what happens when the data is ten times larger, when three services need to coordinate, or when the same logic needs to work across different environments.
There’s a subtler problem that took longer to see. LLM-assisted development lets each developer cover more ground individually. That sounds like pure upside, but it means teams spend less time in architectural discussions – everyone is heads-down producing more code to cover more of the client’s requirements. More code, less coordination. The codebase grows faster than the team’s shared understanding of it.
The code itself becomes harder to read. When different blocks, classes, and modules are generated in separate sessions with separate prompts, each piece follows its own patterns and conventions. One class uses one approach to error handling; the next class uses a completely different one. Without strict specifications – not just for the application, but for each module, each class, each interface – the codebase drifts into inconsistency that compounds over time. More code gets written, but quality drops and scalability drops even further.
That’s why our investment in specifications, architecture conformance checks, and documentation automation isn’t optional tooling – it’s a direct response to the problems that LLM-accelerated development creates when it’s used without discipline.
Speed Is Still a Real Requirement
Moving fast is still an actual requirement. A proof of concept released in seven days that starts a conversation is more useful than a perfect system that ships in ten and misses the window. We know that, and we build that way.
I get told this a lot: “We don’t need a complex system, just make it fast. If we have a week or ten days, we need it in seven.” The deal is: we’ll find a month afterward to rewrite the parts that could have been done properly with three extra days. So there are things we deliberately cut corners on, and then we go back and fix them properly later. But usually there is time – if the product finds a buyer, time eventually becomes available to do the refactoring properly.
The Half of the Decision Nobody Names
What three years taught us is that the speed decision is only half the work. The other half is being upfront about what fast costs: which parts of the architecture were simplified to hit the timeline, where the debt is sitting, what it takes to move from a working demo to something production-ready. When those questions are addressed early, the refactoring gets planned and budgeted. When it doesn’t, it shows up later as a surprise. And surprises in software are always more expensive than the original estimate.
The work we’re most proud of is the work where the speed was deliberate and the tradeoffs were visible from the beginning. That’s what turns a prototype into a product.
Then the Bill Arrived
About six months into applying AI heavily, we hit our first cost wall. The system was working well and getting more expensive with every request. We switched from asking “how can we build more with AI?” to “how can we use fewer tokens without destroying what we’ve built?”
Every LLM API call has a price. At the demo level, that price is invisible. At production scale, it becomes a core architectural constraint. We started asking: can we cache semantically similar queries so we don’t hit the API twice for the same intent? Can we route simpler questions to a smaller, cheaper model and reserve the expensive model for complex ones? Can we restructure prompts to use fewer tokens without losing accuracy? These are the questions that determine whether a proof of concept can survive contact with a real workload.
You can build a fantastic system, and then discover you’d need a nuclear power plant just to run it. Not an exaggeration. It’s a budget conversation nobody had when they were designing the thing.
The companies that are furthest along in AI automation are the ones who modeled the operating cost of their systems before they scaled them, and built efficiency into the architecture from the beginning rather than retrofitting it later.
What the Teams That Get It Right Do Differently
After three years of building AI systems, I’ve seen the same pattern enough times to call it predictable. Someone builds fast, skips the architecture conversation, accumulates debt nobody planned to repay, hits the cost wall at scale, and then discovers the system that worked in the demo doesn’t work in the real world. At that point, fixing it costs more than building it right the first time would have. The “fast and cheap” approach doesn’t save money. It moves the bill to a later date, with interest.
The model is almost never the reason projects fail. It’s the choices made before the model gets involved: scope that’s too vague, data that’s not in the right places, cost structures that don’t survive scaling, timelines that create debt nobody planned to repay.
The teams that get this right start with a question: what does this workflow need? They have someone in the room whose job is to ask the structural questions – how data flows through the system, how costs will behave at scale, what happens when the model is wrong – before those questions become expensive to answer. They have someone translating between what the business needs and what the engineers are building, because without that translation, you’re guessing. And guesses in software have a way of becoming expensive commitments.
Every shortcut needs to be named. Every place the architecture was simplified to hit a timeline needs to be documented. Because when the client decides to move from demo to production, there should be a planned roadmap of what needs to be hardened. You budget for the refactoring because you had that conversation before anyone opened a code editor.
What This Looks Like from the Inside
Most people who struggle with AI automation find that the hardest problems were never technical. Mollick puts it well: AI makes broken processes impossible to ignore.
The companies that get to production – with something that runs at scale without eating the budget – are the ones who treated the architecture conversation as part of the build. Not a prerequisite. They knew the cost structure before it surprised them. They named the debt before it named them.
We’ve spent three years inside exactly those problems, in healthcare and analytics. Building, breaking things, going back to fix them. We know where these systems fail because we’ve been the ones fixing them.
Source link





