How We Got Here: A History of AI as Manufactured Expert Judgement

  • 003 reuben4.jpg
    Reuben Bijl
An abstract vector image using fine lines and subtle shades of blue, red and yellow

The path from GPT-1 to frontier models, and the commodification and homogenisation of expert knowledge underneath.

Part One: The prediction engine (2018–2020)

In June 2018, OpenAI released GPT-1. It had 117 million parameters and was trained on a corpus of books - around 4.5 gigabytes of text. It could complete sentences. It could maintain limited coherence across a paragraph. It was impressive in the way a very good autocomplete is impressive.

Nobody called it intelligent.

The architecture was a transformer, a mechanism for predicting what token should come next given all the tokens that preceded it. The distinction between predicting and understanding seemed academic at the time. It would not remain so.

GPT-2 followed in 2019, trained on 40 gigabytes of internet text, a dataset called WebText, scraped from links shared on Reddit. It had 1.5 billion parameters. OpenAI was sufficiently alarmed by its outputs that they initially refused to release it in full, citing risks of misuse. The model could write convincing fake news articles, complete stories with eerie coherence, and impersonate writing styles.

The public conversation at this point was about scale. The assumption, barely examined, was that intelligence was somewhere on the other end of that curve. Keep scaling, and understanding would emerge.

GPT-3 arrived in 2020 with 175 billion parameters, trained on a dataset of roughly 570 gigabytes of filtered internet text, books, and Wikipedia. It was startling. It could write code, draft essays, translate languages, answer factual questions, and complete almost any text prompt with unsettling fluency.

But it was still, at its core, a prediction engine. It had no notion of truth. No sense of intent. No understanding of who was asking or why. It completed prompts the way a very well-read person might complete a sentence they'd half-heard before by pattern-matching against an enormous internalised library of human expression.

"The Stochastic Parrot" paper, published in 2021 by Emily Bender, Timnit Gebru, and colleagues, named this precisely. A stochastic parrot: generating statistically plausible sequences of tokens without any underlying model of meaning. The point it made was correct.

GPT-3 was extraordinary, and not yet useful.

Part Two: The missing ingredient (2021–2022)

Here is the thing that changed everything, and that almost nobody talks about clearly.

GPT-3 was frustrating to use. It would complete your prompt rather than answer your question. Ask it something, and it would generate what statistically came after that kind of question which might be another question, or an irrelevant tangent, or a confident hallucination. It had absorbed an enormous amount of human knowledge, encoded as pattern. But it had no idea what you wanted.

The gap was judgement.

The model had no way of knowing which of its many possible completions was the better one. Better for you, in your context. That kind of evaluation requires something the model fundamentally lacked, a model of what humans consider good.

In early 2022, OpenAI published the "InstructGPT" paper. The technique itself wasn't new, Christiano and colleagues had used reinforcement learning from human feedback to train simulated agents in 2017, and Stiennon and colleagues had applied it to text summarisation in 2020. What "InstructGPT" did was scale and productise the method. The recipe was deceptively simple in concept, enormously expensive in execution.

They hired human annotators.

These annotators were given prompts - the kinds of things people actually asked GPT-3, and asked to write what a good response would look like: helpful, honest, appropriately cautious. They were asked to rank competing model outputs against each other: which of these is better?

Those rankings were used to train a separate model, a reward model, that learned to predict which outputs humans preferred. That reward model then guided the original GPT-3 through a process called reinforcement learning, nudging it toward outputs that scored highly on human preference.

The resulting model was InstructGPT, actually a family of models at 1.3 billion, 6 billion, and 175 billion parameters. The striking result from the paper was that labellers preferred the smallest 1.3B InstructGPT to the original 175B GPT-3. Fewer parameters, less raw capacity, but dramatically more useful. It answered your question. It followed your instructions. It behaved as if it cared about what you wanted.

This process was called Reinforcement Learning from Human Feedback. RLHF.

It's worth being precise about who those labellers were. For InstructGPT, OpenAI used roughly forty contractors hired through Upwork and Scale AI - trained to a rubric, screened for sensitivity to demographic preferences and for spotting potentially harmful outputs, but not domain experts in any conventional sense. Careful generalists, applying a careful generalist rubric.

What nobody said clearly at the time, (and what rarely gets said clearly now), is what RLHF actually is.

It is the systematic encoding of human judgement into model weights. At this stage, that judgement was generalist. The shift to expert judgement came next, and it is this shift that explains almost everything that has happened since.

Part Three: The race nobody named (2022–2024)

ChatGPT launched on November 30, 2022. Within five days it had a million users. Within two months it had a hundred million. For about seven months it held the record as the fastest-growing consumer application in history, until Meta's Threads passed 100 million sign-ups in five days in July 2023.

The public narrative was: artificial intelligence has arrived. A superintelligence is coming.

The actual story was different.

What had arrived was a system that had been carefully taught, by thousands of human annotators, what good responses looked like. It had internalised those judgements so thoroughly that it could generate novel outputs that matched them across an extraordinary range of domains.

What the labs raced for, after that, was expert human judgement at scale, more domain-specific, more carefully sourced, and encoded into the model.

This is the part of the story most often missed. Frontier models have kept improving because the input data has become drastically more curated. The training signal moved from "what a careful generalist thinks" to "what a domain expert in this specific field thinks."

OpenAI, Anthropic, Google, and others began building massive expert annotation pipelines. Domain specialists: doctors, lawyers, software engineers, financial analysts, research scientists, rating outputs in their area of expertise.

  • This response is medically accurate and this one isn't.
  • This legal analysis is sound and this one misses the key precedent.
  • This code is production-quality and this one will break under load.

An entire industry now exists to supply this. The data labelling solutions and services market was valued at around US$18.6 billion in 2024 and is growing at roughly 20 percent a year. Scale AI booked around US$870 million in revenue that year, with about 90 percent of it from generative-AI work. Surge AI, bootstrapped and rarely talked about outside the field, passed a billion in the same period. These companies don't exist to collect training data in the old sense. They exist to manufacture expert judgement at scale and pipe it into model weights.

That industry is largely invisible in the public conversation. The Wikipedia entry for Generative Pre-trained Transformer tracks the architectural lineage from GPT-1 forward in careful detail. As of writing it does not have a section on the workers doing the labelling. We have spent a lot of attention on datacentre energy and water use, both real, both worth tracking. We have largely missed the contractors in the middle.

The scale is large. Mercor, founded by three twenty-two-year-olds who are now collectively the youngest self-made billionaires in the world, has around 30,000 active professionals on its platform. Scale AI says it has access to more than 700,000 graduates. These are not generalist taggers. They are doctors, lawyers, philosophers, statisticians and software engineers, contracted one project at a time to encode what good looks like in their field.

What made the models more useful was the quality and specificity of the human judgement embedded in them. The models did keep getting bigger. That explains less of the improvement than the headline numbers suggest.

It is also why Meta has struggled to build a competitive frontier model despite spending more than most of its rivals on the attempt - including a roughly fifteen-billion-dollar stake in Scale AI in 2025, an attempt to buy its way into the labelling pipeline. Meta's product DNA is the harvesting of basic human impulses at scale: attention, emotional response, scroll. Curating expert judgement is a different kind of work. The companies that have got further with frontier models so far were built around editorial and research norms in a way Meta never quite was. Whether that is a permanent advantage or a temporary one, I don't know.

Anthropic made this architecturally explicit with Constitutional AI. The 2022 paper described a method: rather than relying purely on individual human raters, the model was trained to critique its own outputs against a short list of stated principles, and to generate synthetic training data aligned with them. Several years of iteration followed.

In January 2026, Anthropic published the document itself - the constitution Claude is trained against. It runs to around 23,000 words, more if you count the preface and acknowledgements. It is released under CC0 and sits at anthropic.com/constitution. It is not hidden. It is, however, a document most people have never heard of. It determines, at a foundational level, what Claude considers helpful, what it considers harmful, how it navigates ambiguity, what it treats as a contested question and what it treats as settled, how it handles requests it's uncomfortable with.

Every Claude response is a judgement. That judgement traces back to a document authored by specific people, reflecting specific values, shaped by a specific cultural and professional context.

The model does not know this. It cannot say: "I'm applying the values of a 2024 Anthropic constitutional document right now." It simply expresses them.

Part Four: DeepSeek and the inheritance problem (2025–2026)

In early 2026, Anthropic published evidence of what it called industrial-scale distillation attacks. Three companies: MiniMax, Moonshot AI, and DeepSeek, had together created over 24,000 fraudulent accounts and generated more than 16 million exchanges with Claude, querying the model to extract its outputs as training data for their own systems. Most of the volume was MiniMax. DeepSeek's own share was much smaller, around 150,000 exchanges, but Anthropic specifically pinned two behaviours on DeepSeek: prompting Claude to articulate its internal reasoning step by step to generate chain-of-thought training data, and producing what Anthropic called "censorship-safe alternatives to politically sensitive queries," training the receiving model to navigate certain topics in particular ways.

The public coverage focused on the scale of the operation and the geopolitical implications. The quieter implication is what these attacks tell us about how these models actually work.

The standard story is that the value of a frontier model comes from scaling architecture on public-internet text and books. If that were the whole story, distillation attacks would not be worth the trouble. Public text is, by definition, public, the same Wikipedia, the same GitHub, the same out-of-copyright libraries that Claude trained on are available to anyone. What is not available is the layer on top: the contracted expert annotations, the constitution that shaped them, the reinforcement learning that turned a prediction engine into a useful collaborator. That is the part that costs hundreds of millions of dollars to build, and it cannot be scraped from anywhere else.

That is what was worth tens of thousands of fake accounts. They were sourcing judgement.

To the extent the distillation worked, the patterns inherited trace back through Claude to the annotation pipeline and the constitution that shaped it.

DeepSeek, on the whole, is not a San Francisco product. It has its own substantial pre-training and reinforcement learning work, and its openly released distilled models are built on Qwen and Llama bases. But certain reasoning behaviours - certain ways of structuring a careful answer likely do trace upstream, through Claude, to a specific group of people in San Francisco.

The flags are different. Some of the underlying epistemology may be more shared than either side would like to admit.

Part Five: What this means

The history of large language models is a story about the systematic codification of expert human judgement, encoded into model weights and deployed globally as if it were neutral capability.

The shift came with ChatGPT, a prediction engine taught, by thousands of expert humans, what good predictions looked like. Every model since has refined that process. The recipe has been the same. The scale and the specificity have grown.

The result is access to encoded expertise across medicine, law, engineering, finance, science, and culture, available on demand, at near zero marginal cost. A junior professional anywhere in the world can now access the distilled judgement of experts they could never previously afford or reach.

That expertise was produced by specific people, in specific contexts, working to specific values. It encodes a particular theory of what good looks like: in a web application, in a legal argument, in a medical recommendation, in a conversation about a sensitive topic.

The model does not announce this. It simply responds, as if it knows with expert confidence.

What we have built, when you strip back the framing, is a system for the commodification and homogenisation of expert knowledge. The judgement that used to be embodied in expensive, slow, geographically rare professionals can now be bought by the API call. Knowledge in the old sense was always partly a market, but it was a market mediated by people and by time. That mediation is being stripped out. And because most of the world is reaching for the same handful of models, built from the same annotation pipelines, the answers are converging. Knowledge is becoming a product, and the same product, in a way it never quite was before.

You can see this without the research. Most of the marketing arriving in inboxes now reads as if written by the same hand. There is a recognisable cadence to it, the way claims get set up and then pivoted out of, the way the sharpest sentences arrive with a pre-emptive softening tucked just behind them. The grain of the model is visible from across the room, in writing produced by people who never met. Writers who use these tools have to edit against that grain, and the patterns keep coming back, because they are the deep grain of how these models talk. The research will catch up. The convergence is already there.

This lens makes more sense of one of the most ambitious documents in recent AI discourse. In October 2024, Dario Amodei, Anthropic's CEO, published “Machines of Loving Grace” an essay arguing that sufficiently powerful AI could compress fifty to a hundred years of scientific progress into five or ten. His lead cases were biology, the treatment of disease, and the extension of healthy human lifespan.

The argument is stronger than the headline summary suggests. Amodei's claim is that the bottleneck in empirical science is the speed of the iteration cycle - proposing an experiment, running it, interpreting it, choosing what to try next. Compress that loop with models that can reason through experimental design at machine speed, and with robotic systems that can run the experiments themselves, and you have an industrialised supply of expert work running at a faster clock.

That is the same thesis as this essay from a different angle. We have built a way to commodify and scale the work of intelligence. Run that loop fast enough across enough domains, and the rate at which the world produces useful empirical results changes shape. The more carefully I read him, the harder I find it to dismiss.

When a New Zealand organisation deploys one of these systems, whether frontier American or distilled Chinese, they are not accessing a neutral intelligence. They are running someone else's constitution. Someone else's theory of correct expert behaviour, quietly generalised across every domain the model touches.

Sovereignty is about whose judgement gets encoded.

That question is being asked. Sorensen and colleagues have been writing on pluralistic alignment. Hannah Kirk's group has been mapping personalisation and cultural bias in large models. Anthropic itself ran a Collective Constitutional AI experiment, using public deliberation to gather constitutional input from a wider group than the team in San Francisco.

But the conversation and the operational reality are not running at the same scale. The labelling industry alone was around eighteen and a half billion dollars in 2024. The corpus of academic work on whose values get encoded is, by comparison, small. The number of New Zealand voices in either is smaller again.

Let me put this plainly. I have worked in software for twenty years. I like to understand the things I work with. At face value I thought the value of these systems was the internet-scale text, that someone had finally worked out how to make pattern matching across all of human writing produce useful output.

I had no idea how large the training industry actually is. I had no real sense of how much of what comes out of a frontier model traces back to it, rather than to the raw text. Once you see it, the ramifications start running in every direction, what these systems are good at, what they cannot do, who gets to shape them, who is not in the room while it happens.

The strange thing is that the people who would call me naive for not having seen this are often the same people who will tell you, with confidence, that AGI is here. You can argue the definition. What seems harder to argue is that what we have is the product of a manufacturing industry quietly building expert judgement to order. That is what got us here.

Generating text, getting expert opinion on tap, writing code, drafting a contract, producing a design, at a quality and a marginal cost that didn't exist three years ago, is a real market shift. I love these tools. I find them a joy to use. Not because they tell me my eyes are a beautiful shade of whatever colour they happen to be today, but because they let me do work I could not otherwise do. The argument is about being clear about what we are using, and where the capability is coming from.

And being clear is harder than it sounds. I'm building a web app right now in Next.js. We were already using it on the team, so the choice was made before any model weighed in. But if a frontier model had recommended something else, gently, plausibly, woven into a longer answer about how to structure the project, would I have picked it without thinking it through? I am not sure.

Nobody called GPT-1 intelligent. Worth asking, eight years on, what we are now calling its descendants, and who decided.