From Mendel’s Peas to ChatGPT: A History of Machine Learning

6 min read

The roots of machine learning go back further than most people realize. Not to a computer lab or a Silicon Valley garage, but to a monastery garden in the 1850s. A monk named Gregor Mendel spent seven years cross-pollinating pea plants, and in doing so, he established principles that would become foundational to a field that wouldn’t exist for another century.

The Monk and His Peas

From 1856 to 1863, Mendel conducted one of the most famous experiments in scientific history. He cross-pollinated pea plants and tracked seven specific traits across generations: plant height, pod shape, pod color, seed shape, seed color, flower position, and flower color. Over 28,000 plants.

He noticed something. When he crossed tall plants with short plants, all the offspring were tall. But when those offspring self-pollinated, the next generation showed a consistent ratio: roughly 3 tall to 1 short. This ratio appeared across every trait he studied.

Mendel recognized this wasn’t coincidence. It was a pattern pointing to underlying rules. He concluded that traits were determined by discrete units of inheritance (what we now call genes) that came in pairs, one from each parent. Some were dominant, some recessive.

Why does this matter for machine learning? Because Mendel was doing, by hand, what ML algorithms do at scale: collecting large amounts of data, identifying patterns, using those patterns to make predictions, and selecting the most relevant features for analysis. His approach was data-driven, mathematical, and reproducible. He just didn’t have a computer.

Turing’s Vision

Fast-forward to the 1950s. British mathematician Alan Turing began laying the theoretical groundwork for artificial intelligence.

In his 1950 paper “Computing Machinery and Intelligence,” Turing proposed what became known as the Turing Test: if a human evaluator couldn’t distinguish between a machine’s responses and a human’s, the machine could be considered intelligent. That debate is still going.

More importantly, Turing conceptualized a universal computing machine capable of simulating any other machine’s computation. This became the theoretical foundation for modern computers and, eventually, the hardware that runs ML algorithms today. He predicted that machines would “eventually compete with men in all purely intellectual fields.” Bold for 1950. Increasingly accurate by 2025.

The Birth of AI: Dartmouth, 1956

Turing’s theoretical work needed a community. In the summer of 1956, a group of scientists gathered at Dartmouth College for what became the formal birth of artificial intelligence as a field. John McCarthy proposed the term “artificial intelligence” at that workshop. The organizers ambitiously claimed that “every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.”

That turned out to be harder than they expected. But the workshop established AI as a distinct field, brought together researchers from mathematics, psychology, and engineering, and set a long-term vision that’s guided research for decades.

ELIZA: The First Chatbot

Between 1964 and 1966, MIT’s Joseph Weizenbaum created ELIZA, one of the world’s first chatbots. It simulated a therapist using surprisingly simple pattern matching. It identified keywords in user input and generated responses by rephrasing what the user said as a question.

User: “I am feeling sad.” ELIZA: “Why do you feel sad?”

The technique was basic. What surprised Weizenbaum was how quickly users attributed human-like qualities to the program, even when they knew it was a computer. People formed emotional attachments. Weizenbaum himself found this disturbing.

ELIZA demonstrated something important: apparent intelligence can be simulated without true understanding. That insight is still relevant every time someone anthropomorphizes ChatGPT.

The Monkey and Wall Street

In 1973, economist Burton Malkiel proposed a thought experiment in “A Random Walk Down Wall Street”: a blindfolded monkey throwing darts at stock listings could select a portfolio performing as well as expert-chosen investments. The Wall Street Journal later ran actual dartboard contests. Random selections often matched or beat the professionals.

This matters for ML because it established a principle that’s still essential: before you trust a complex model, compare it to a simple baseline. If your sophisticated algorithm can’t beat a monkey throwing darts (or a basic average, or a random guess), it’s not actually learning anything useful. It also highlighted the danger of overfitting, seeing patterns where none exist.

The Perceptron: The First Learning Machine

In 1957, Frank Rosenblatt created the perceptron, the first artificial neural network that could learn from data. Instead of being explicitly programmed with rules, it adjusted its behavior based on experience. This was a paradigm shift.

The perceptron introduced the concept of a learning algorithm that could adjust weights over time. Iterative improvement through training. This idea became central to virtually all modern ML.

But it had a serious limitation: it could only solve problems where a straight line could separate the categories. Complex, nonlinear problems were beyond it. Think of it as a very smart switch. Powerful, but still just a switch.

In 1969, Minsky and Papert published a book highlighting these limitations. Funding dried up. Interest faded. The AI Winter began.

The AI Winter

The 1970s and 1980s were rough for AI research. Funding disappeared, promises went unfulfilled, and the field stalled publicly. But important work continued quietly.

Vladimir Vapnik and Alexey Chervonenkis developed statistical learning theory, providing mathematical frameworks for understanding when and why ML algorithms generalize to new data. Judea Pearl developed probabilistic reasoning and Bayesian networks. Ross Quinlan created decision tree algorithms that are still widely used today.

The Winter also pushed researchers toward practical applications in specific domains, leading to advances in speech recognition and computer vision that would prove essential later.

Backpropagation: The Thaw

The AI Winter began to thaw in the 1980s with the popularization of the backpropagation algorithm. The math had existed since the 1960s, but a landmark 1986 paper by Rumelhart, Hinton, and Williams demonstrated its practical potential for training multi-layer neural networks.

Backpropagation (backward propagation of errors) provided an efficient way to figure out how much each weight in a network contributed to the overall error. Using the chain rule of calculus, it could train networks with multiple layers, overcoming the single-layer limitation that had killed the perceptron.

Multi-layer networks trained with backpropagation could approximate virtually any continuous function. They could automatically extract relevant features from raw data instead of requiring hand-engineered ones. In theory, this was a breakthrough. In practice, the computational power and large datasets needed to make it work weren’t available yet. Neural networks remained a niche approach through the 1990s and early 2000s.

Support Vector Machines

While neural networks waited for hardware to catch up, Vladimir Vapnik and colleagues introduced Support Vector Machines (SVMs) in the 1990s. SVMs found the optimal boundary separating different classes in high-dimensional space, and they had a strong theoretical foundation explaining why they worked.

Their key innovation was the “kernel trick,” which let them efficiently handle high-dimensional, nonlinear problems. SVMs offered excellent performance with limited training data, handled high-dimensional data well, and were less prone to overfitting. For nearly two decades, they were state-of-the-art for many applications, from text classification to bioinformatics.

SVMs demonstrated that ML could thrive with approaches beyond neural networks. But the computational barriers that kept neural networks on the sidelines were about to fall.

Big Data Meets GPUs

As we entered the 2000s, several things converged. The internet, social media, and connected devices generated an explosion of data. Computational power increased dramatically, and GPUs (originally built for video games) turned out to be perfect for the parallel math that neural networks require.

This convergence was transformational. Abundant data provided the training material that complex models needed. GPUs provided the processing power to actually train them. Companies like Google, Amazon, and Facebook began applying ML at scale. The stage was set.

The Deep Learning Revolution

In 2006, Geoffrey Hinton and colleagues showed how to effectively train deep neural networks with many layers. By the early 2010s, what started as improved training techniques had become a genuine revolution.

The pivotal moment came in 2012 when a deep learning model called AlexNet dramatically outperformed traditional approaches in image recognition. Not by a small margin. It wasn’t close. Speech recognition error rates dropped significantly. Natural language processing advanced with models that could capture semantic relationships between words.

The fundamental insight: deeper networks with more layers could learn increasingly abstract representations. Lower layers detect edges in images, middle layers recognize shapes, higher layers identify objects. This hierarchical feature learning proved powerful across applications that had seemed intractable just years before.

Transformers: Attention Changes Everything

In 2017, a team at Google published “Attention Is All You Need,” introducing the Transformer architecture. If you’ve used ChatGPT, Claude, or any modern AI assistant, you’re using technology built on this.

Previous approaches processed information sequentially, one word at a time, often losing important details from earlier in a long text. Transformers process entire sequences simultaneously, connecting related ideas regardless of how far apart they appear. This was a fundamental improvement.

What made Transformers truly significant was their scalability. As researchers added more data and computing power, capabilities improved dramatically and sometimes unpredictably. Larger models showed qualitatively new abilities that surprised even the people who built them.

Originally designed for language, Transformers proved versatile enough for images, code generation, scientific research, and creative tasks. But this power comes with real costs. Training a single large language model can consume the energy equivalent of hundreds of homes for a year. And understanding how these models make decisions remains an open problem.

What the History Teaches

A few patterns repeat across this entire timeline.

Cycles of hype and disappointment. Minsky and McCarthy predicted human-level AI in the 1960s. It didn’t happen. The resulting disillusionment caused the AI Winter. We should remember this when evaluating today’s claims.

Breakthroughs build on decades of quiet work. Deep learning’s 2012 moment depended on backpropagation from 1986, statistical learning theory from the 1970s, and neural network concepts from the 1950s. The overnight successes were decades in the making.

Ethics lag behind capabilities. Algorithmic bias, privacy violations, environmental costs, and concentration of AI power in well-funded organizations are all problems that emerged faster than the safeguards to address them.

The field benefits from diversity of approaches. SVMs dominated when neural networks couldn’t scale. Decision trees remain useful alongside deep learning. Betting everything on one paradigm has historically been a mistake.

Where We Are Now

The journey from Mendel’s pea plants to today’s AI systems spans nearly 170 years of human curiosity and persistence. Each breakthrough built on previous work, often in unexpected ways.

Machine learning continues to evolve rapidly. The challenges (energy consumption, explainability, bias, accessibility) are significant. So are the possibilities. Understanding this history helps you see not just how far we’ve come, but why skepticism and critical thinking are as important as technical innovation.

The next chapters are still being written.

Making Sense of Data: From Statistics to AI

In today's data-driven world, the terms statistics, data science, machine learning, and artificial intelligence are often used interchangeably, yet each field has distinct characteristics...
mladvocate
3 min read

What is Machine Learning? A Beginner’s Guide

Your phone’s camera doesn’t just take pictures anymore. It decides when to use night mode, adjusts focus automatically, and can even remove photobombers. Nobody...
mladvocate
2 min read

Welcome to Machine Learning Advocate

ChatGPT answering your questions, your iPhone sorting photos by face, Netflix deciding what to show you next. All of it is machine learning. Most...
mladvocate
43 sec read