What Is Cross Validation in Machine Learning

Cross validation in machine learning is not just a technique for testing a model. It is a way of asking a harder question than a single test can answer.

A model tested once on one slice of data might look accurate because that slice happened to suit it. Cross validation tests the model repeatedly on different slices and asks whether performance holds across them. A model that performs consistently is more likely to hold up on data it has never seen. A model that does not is telling you something important before it ever reaches production.

Table of Contents

Why a Single Test Is Not Enough

When you split a dataset into training and test sets once, evaluation depends on that particular split. The training set is what the model learns from. The test set is what you use to measure how well it learned. The result is a single number representing model performance.

The problem is that a single number drawn from a single split is sensitive to which examples ended up where. If the test set happened to contain examples that were easier to predict, performance looks better than it should. If it contained examples that were harder, performance looks worse. Either way, you are measuring the model and the split together, not the model alone.

This is not a hypothetical concern. A company building a model to screen job applicants split its dataset once, trained the model, and evaluated it on the held-out test set. Performance looked strong. When the model was applied to new applicants over the following quarter, it performed noticeably worse. A later review found that the original test set had, by chance, underrepresented applicants from certain roles the model struggled to evaluate. The single split had produced a misleadingly optimistic result.

Cross validation reduces this risk by repeating the evaluation process across multiple splits and averaging the results. No single split determines the outcome.

How Cross Validation Works

The intuition behind cross validation is straightforward. If a model has genuinely learned something, it is likely to perform reasonably well regardless of which examples it was trained on and which it was tested on. Cross validation checks this by changing the split repeatedly and seeing whether performance holds.

The most common approach is called k-fold cross validation. The dataset is divided into k equal parts, called folds. The model is trained on k-1 of those folds and tested on the remaining one. That process repeats k times, each time holding out a different fold as the test set. At the end, you have k performance measurements rather than one. Those measurements are averaged to produce a more stable estimate of how the model should perform on new data.

If k is set to five, the model is trained and tested five times, each time on a different 80/20 split of the data. The first fold is held out while the model trains on the other four. Then the second fold is held out while the model trains on the remaining four. This continues until every fold has served as the test set exactly once. No example is ever in the training set and the test set at the same time.

If performance is consistent across all five folds, that consistency is meaningful. If performance varies significantly from fold to fold, that variation is also meaningful. It suggests the model is sensitive to which examples it sees during training, which is a signal that the model may be overfitting or that the dataset contains subgroups the model handles differently.

The choice of k involves a tradeoff. Larger values of k mean each test set is smaller and each training set is larger, which produces a more stable estimate but requires more computation. Smaller values of k are faster but produce estimates that are more sensitive to how the data happened to be divided. Five and ten are the most common choices, and either works well for most problems.

Cross Validation, Overfitting, and Regularization

Cross validation does not prevent overfitting. It helps reveal it.

When a model overfits, it has learned patterns specific to the training data that do not hold on new data. A single test set might not surface this clearly, particularly if the test set shares some of those same patterns with the training set. Cross validation makes overfitting harder to miss because the model is evaluated on multiple subsets. A model that has learned genuine patterns will perform consistently across folds. A model that has learned patterns specific to the training data may perform well on some folds and poorly on others.

Regularization addresses overfitting by constraining what the model is allowed to learn during training. Cross validation helps measure whether that constraint is working. The two techniques are often used together: regularization reduces overfitting, and cross validation provides evidence that it has done so. Neither replaces the other. A model trained with regularization still needs to be evaluated, and cross validation is a more reliable way to do that evaluation than a single split.

When to Use Cross Validation and When Not To

Cross validation is most useful when datasets are limited and when a reliable estimate of performance is needed before deployment. It is standard practice during model selection, where you are choosing between different model types, and during hyperparameter tuning, where you are adjusting the settings that control how a model learns. In both cases, decisions are being made based on performance estimates, and those estimates need to be trustworthy.

An e-commerce company selecting between several candidate models to predict which customers were likely to abandon their carts used cross validation to compare them. Each model was evaluated across ten folds. Two models that had looked similar on a single test split showed meaningfully different consistency across folds. The model with more consistent performance across folds was selected, and it held up better when deployed.

Cross validation is less useful when datasets are very large. With millions of examples, a single well-constructed split is usually sufficient and far less computationally expensive.

It is also less appropriate when related examples are split across folds. If multiple records come from the same customer, patient, device, household, or transaction, the model may effectively see part of the same case during training and testing. That is not a real test. The folds need to be constructed so related examples stay together, or the evaluation will be more optimistic than it should be.

Time-ordered data has a similar problem. Standard k-fold cross validation divides data randomly, which means the model can end up training on future data and testing on the past. If you are predicting future events, that is the wrong direction. Train on earlier data and test on later data, in that order.

Where Cross Validation Fits

Cross validation sits where model-building starts to become model evaluation. It does not change what the model learns. It changes how confidently you can say the model has learned something that will hold up on new data.

Used alongside regularization and careful data collection, cross validation in machine learning helps narrow the gap between a model that performs well on your data and one that performs well on data it has never seen.

This post is part of a series on why machine learning models fail in production and how to diagnose them. For more information:

What Is Overfitting in Machine Learning
What Regularization Does and Why Your Model Needs It
Bias vs. Variance: Why Your ML Model Can’t Have It All

What Is Cross Validation in Machine Learning

Why a Single Test Is Not Enough

How Cross Validation Works

Cross Validation, Overfitting, and Regularization

When to Use Cross Validation and When Not To

Where Cross Validation Fits

What is Feature Engineering in Machine Learning?

What Is Overfitting in Machine Learning

What Regularization Does (and Why Your Model Needs It)