Being an AGI Believer, Not Skeptic - Lessons from Augment Code

Jan 31, 2026 · 8 min read · Barry Dong

It has been just over three years since ChatGPT launched in November 2022. In that time, progress in transformers, large language models, and AI systems has felt less like three years and more like a decade.

Augment Code was a concentrated lesson in what it takes for AI research to compound—and what happens when it doesn't. Shortly after launching the initial Gemini at DeepMind, I joined Augment Code as a founding researcher to work on AI for coding. The thesis was compelling: coding is one of the largest AI markets, with an effective ceiling bounded by the total compensation of all developers worldwide.

We assembled an unusually strong early team—researchers from Google Brain, DeepMind, and FAIR; experienced systems engineers; and executives with prior startup success. The timing was right, the skill sets were aligned, and the company was well-funded by early-stage standards [1]. Yet today, Augment Code has faded from the competitive landscape.

In hindsight, we made several correct early bets. Many of them simply failed to compound. That gap — between being right and winning — is what this post is about.

A quick clarification: this post is not a judgment on Augment Code's current trajectory. After a painful transition with many early researchers departing, the company repositioned itself as a product-first organization. What follows reflects my own lessons from a specific phase of the company — not a verdict on its future. I continue to wish the team the very best.

Bet #1: Verifiable Signals Were the Right Foundation

By mid-2023, we built verifiable evaluation environments: real repositories across languages, hundreds of tasks per repo, unit tests, and Kubernetes-based sandboxes. These produced ground-truth signals that cleanly distinguished correct from incorrect behavior. We used them as internal benchmarks and, in some cases, explored them as training signals for reinforcement learning.

Notably, this work predated the public release of SWE-Bench [2] by several months, which only appeared in late 2023. In hindsight, this was exactly the kind of signal infrastructure frontier labs would later spend hundreds of millions of dollars acquiring. The mistake wasn't technical—it was strategic. We treated these systems as evaluation assets, not as the core of a scalable reward and data flywheel. Compounding requires not just verifiable signals, but committing to scale and reuse them systematically.

Bet #2: Real-Time RAG Mattered — Until It Didn't

In late 2023, we made real-time retrieval over million-line enterprise codebases work in practice. Latency constraints forced us to compress, filter, and tolerate messy context — something models at the time handled poorly.

This gave us a real edge. But by late 2024, user behavior shifted. Agentic workflows tolerated longer latency and favored recall over precision. Architectural constraints that once differentiated us became less relevant. Being early helps—but only if the user paradigm stays stable.

Bet #3: Global Next Edit Was Powerful — But Slow to Ship

By early 2024, we had a global next-edit model that could suggest coherent changes across files and repositories [3][4]. In spirit, it went beyond local tab completion.

The failure here was not research quality but iteration speed. It took nearly a year to reach production. In competitive AI products, a year is an eternity.

Bet #4: RL from Real Developers Actually Works

In 2024, I focused full-time on reinforcement learning from real developer behavior [5]. We trained models directly from implicit and explicit user feedback and shipped them to production. It worked.

But RL systems don't survive on individual conviction. They require organizational commitment. After I left, this line of research paused. Meanwhile, competitors doubled down and published similar methods a year later [6].

Bet #5: Owning the Training and Inference Stack Mattered

By the end of 2023, we had built an in-house LLM training and inference stack. Training ran with solid MFU, and the inference stack supported low-bit quantization, continuous batching, and speculative decoding, all compatible with our product requirements. Notably, this entire stack was built and operated by a relatively small team. Around the same time, our in-house coding pretraining reached parity with publicly released models such as DeepSeek-Coder [7].

At the time, owning the full stack gave us real leverage: rapid research experimentation and direct control over end-to-end system performance. As frontier research was deprioritized, however, the stack gradually fell behind industry state of the art, and the organization shifted toward API-based models. Like research itself, in-house infrastructure only compounds if the organization commits to using and evolving it.

The Real Tension We Never Fully Resolved

We correctly identified coding as the largest near-term AI market. Spending went from zero to billions in under three years.

What we never fully resolved was whether we were building:

a frontier coding model company, or
a fast-iterating coding assistant product

Both are viable. Doing both halfway is not.

What Being an AGI Believer Actually Means

Being an AGI believer isn't about timelines or optimism. It's about choosing where compounding happens — data, models, systems, or product loops — and committing even when short-term signals disagree.

Skepticism often looks like pragmatism. But in AI, indecision is the most expensive position.

There is no single formula for building a winning AI startup. But I'm increasingly convinced of one thing: you must pick a compounding axis and let it hurt.

References

Scott Dietzen, Augment Inc. Raises $227 Million, Apr 26, 2024.
Carlos E. Jimenez, et al., SWE-bench: Can Language Models Resolve Real-World GitHub Issues, Oct 2023.
Arun Chaganty & Jiayi Wei, Introducing Next Edit for VSCode, Feb 19, 2025.
Jiayi Wei & Arun Chaganty, The AI Research Behind Next Edit, Feb 19, 2025.
Barry (Xuanyi) Dong, Reinforcement Learning from Developer Behaviors, Nov 26, 2024.
Jacob Jackson, et al., Improving Cursor Tab with online RL, Sep 12, 2025.
Daya Guo, et al., DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence, Jan 2024.