Training vs. Testing: Why Your Model Needs to Prove Itself on New Data

5 min read

Does this sound familiar? You’re tutoring a student for an upcoming math test. You help them solve dozens of practice problems over several days, and by the end, they’re getting every problem right. You feel confident they’ve mastered the material.

Test day arrives, and they completely bomb it. New problems that required the same concepts but in slightly different ways? Total disaster. They memorized your practice problems but never actually learned the underlying math.

This exact mistake costs companies millions every year and has led to AI systems that discriminate, medical devices that fail, and self-driving cars that crash. It’s also the #1 reason why beginners think they’ve built something amazing when they’ve actually built nothing useful.

This scenario plays out constantly in machine learning, and it’s why you can’t just judge models on data they’ve already seen. I’ve watched teams celebrate 97% accuracy only to deploy models that barely work. The difference? They never properly tested on genuinely new data.

We’re continuing our Mental Models for ML series here. So far we’ve covered supervised vs. unsupervised learning and classification vs. regression. Now we’re hitting something that seems obvious but somehow trips up even experienced teams: the absolute necessity of testing your models on data they’ve never encountered.

The memorization trap

Machine learning models are basically very sophisticated pattern memorizers. When I say they “learn,” I’m being generous – they’re really just adjusting thousands of internal parameters until they get good at predicting the examples you’ve shown them. No consciousness, no understanding, just mathematical optimization finding the best settings for your specific dataset.

The problem? Give a complex enough model enough time, and it’ll memorize your training data perfectly while learning absolutely nothing useful about the real world.

Think of it like learning to drive. Practicing in an empty parking lot (training) might make you feel confident, but the real test is navigating rush hour traffic you’ve never seen before (testing).

I saw this happen with a public health department that built what they thought was an incredible system for predicting disease outbreaks. 97% accuracy! They were ready to revolutionize epidemic response. But when I dug into their process, I found they’d accidentally included the reporting hospital’s ID number as an input feature. The model wasn’t learning anything about disease patterns – it was literally just memorizing which specific hospitals had reported outbreaks in their historical data.

For new outbreaks in different regions? Completely useless. After fixing this and testing properly, their accuracy dropped to 76%. Still decent, but a far cry from the 97% they thought they had.

How to split your data (without screwing it up)

The basic approach is straightforward:

  • 70-80% for training
  • 20-30% for testing

In practice: Import your data, randomly shuffle it, take the first 80% for training, save the last 20% for testing. Train your model only on the 80%. Then – and only then – see how it performs on that untouched 20%.

But the devil’s in the details. Random splitting works most of the time, but not always. If you’re working with time-based data like stock prices or sensor readings, you need to train on earlier data and test on later data. Otherwise you’re basically letting your model see the future, which works great until you try to actually predict the future.

Here’s the golden rule: your testing data stays completely locked away until the very end. Don’t peek. Don’t use it to choose features. Don’t use it to tune parameters. I’ve seen teams gradually corrupt their test sets by repeatedly checking performance and adjusting their approach. It’s like studying the actual exam questions – sure, you’ll do great on the test, but you haven’t actually learned anything.

The three-way split that actually works

But wait – if you can’t look at test data during development, how do you know if your changes are helping?

Enter validation data:

  • 60-70% training
  • 15-20% validation
  • 15-20% testing

Think of validation as practice tests. You study (train), take a practice test (validate), adjust your approach, repeat. The real exam (test) stays hidden until you’re done.

This approach proved crucial on a water quality monitoring project where the stakes were high. The team iterated for weeks, getting 92% accuracy on their validation set for predicting contamination events. When they finally ran the untouched test set, they got 91%. Close enough that everyone trusted it would work in production – and it did.

When you don’t have enough data

Sometimes you’re working with limited data and can’t afford to set aside big chunks for validation and testing. Cross-validation helps here.

You split your training data into 5 or 10 pieces, train multiple models where each uses a different piece for validation, then average the results. It’s like taking multiple practice tests with different question sets.

This technique particularly shines in archaeological research where data is precious. A colleague’s project identifying pottery styles across ancient civilizations saw cross-validation boost performance by 4% compared to a simple split – meaningful when you’re trying to understand cultural connections across millennia.

Real examples where this matters

House price prediction is a perfect beginner example. You’re predicting prices using square footage, bedrooms, and neighborhood. Train on houses sold in 2020-2022, test on houses from 2023. If your model only works on the old houses, it’s useless for today’s market – housing markets shift constantly.

Music preservation efforts have gotten interesting results with this approach. Researchers trained models to identify and catalog endangered folk songs from field recordings using thousands of ethnomusicologist-labeled samples. The system learned to recognize specific musical patterns, vocal techniques, and instrumental signatures. After training, it could process new archive recordings and help preserve cultural heritage. The key? Testing on recordings from completely different regions and ethnic groups than the training data.

I’ve seen agricultural optimization projects tackle this well – crop yield prediction through soil analysis and weather patterns. Teams train on farms across the Midwest, then test on farms in completely different climates and soil types. No memorizing specific fields, just recognizing agricultural patterns that transfer across regions.

The mistakes I’ve made (so you don’t have to)

Data leakage is sneaky. A linguistics research team I know normalized all their dialect data before splitting it. Sounds innocent, right? Wrong. The normalization process used statistics from the entire dataset, meaning information about the test set influenced the training data preprocessing. Performance looked amazing until they fixed it.

Time travel is another classic. There’s a climate modeling system that performed beautifully when researchers randomly selected test years from historical weather data. In production? Complete failure. Turns out random splitting had let the model use future climate information to predict past patterns. Once they properly tested forward in time only, reality hit hard.

Validation set overfitting happens when you iterate too many times on the same validation data. You start optimizing for that specific set’s quirks rather than general patterns. Sometimes you need to refresh your validation data if you have enough examples.

What makes a model “good enough”?

There’s no magic accuracy number that makes a model deployable. Context is everything.

Archaeological dating might need 95%+ accuracy because miscategorizing artifacts could rewrite history. Urban planning prediction might be valuable at 70% because the stakes are lower. Sometimes a 75% accurate model that explains its reasoning beats a 90% accurate black box.

The real question isn’t “Is my model good?” It’s “Is my model good enough for this specific problem, with these specific consequences for being wrong?”

When your model disappoints (and what to do about it)

Finding out your model doesn’t work as hoped is normal. I’ve been there more times than I care to admit.

New learners often panic when their test accuracy is lower than training accuracy. This is normal! A gap of 5-10% is usually fine. It’s when training is 95% and testing is 60% that you know you’re in trouble.

Start with your data. Performance problems often trace back to data quality issues, not algorithm choice. Then look at where your model makes mistakes – are there patterns? Specific types of examples it struggles with?

If it performs poorly on both training and validation data, it’s probably too simple (underfitting). If it crushes training data but fails validation, it’s memorizing rather than learning (overfitting).

Machine learning is iterative. Most successful projects go through multiple rounds of “this doesn’t work, let me try something else.”

The bigger picture

Training vs testing might seem like a technical detail, but it’s really about the fundamental challenge of building systems that work in the real world, not just on the data you happen to have.

Our mental models so far:

  • Supervised vs unsupervised: Do you have labeled examples?
  • Classification vs regression: Predicting categories or numbers?
  • Training vs testing: Does it actually generalize?

These frameworks help you think systematically about any ML project, regardless of which specific algorithms you end up using.

Next up we’ll tackle bias vs variance – how models can fail in two completely different ways and why finding the right balance matters more than just optimizing for accuracy.

Ever had a model that looked amazing in development but flopped in production? What went wrong? I’m always curious about where teams hit these walls because the patterns tend to repeat across different domains.

Bias vs. Variance: Why Your ML Model Can’t Have…

Here’s something that frustrated me for years when I was learning machine learning: every time I fixed one problem with my models, I seemed...
mladvocate
7 min read

Prediction vs Inference: Different Goals in ML Analysis

Have you ever wondered why some machine learning applications can make accurate recommendations but can’t explain why, while others provide clear reasoning but aren’t...
mladvocate
5 min read

Classification vs Regression: Predicting What vs. How Much

In our previous post, we explored supervised vs. unsupervised learning. Now we’re diving into another fundamental choice you’ll face in every supervised learning project:...
mladvocate
5 min read