A Friendly Introduction to Principal Component Analysis

Most datasets don’t have two or three measurements per observation. They have dozens, sometimes hundreds. A patient record might include blood pressure, cholesterol, glucose levels, BMI, age, and dozens of lab results. A manufacturing sensor array might log temperature, vibration, pressure, humidity, and electrical output from every machine on the floor, every few seconds. The data is rich, but that richness creates a problem: when everything is measured, we often capture the same underlying signal in multiple ways, which adds noise and makes patterns harder to see. That’s the problem Principal Component Analysis, or PCA, was built to solve.

Table of Contents

When More Data Makes Machine Learning Harder

When a dataset has many features, say fifty or a hundred measurements per observation, we call it high-dimensional. The dimensions are just the number of features, and once you get past three, humans can’t directly visualize the data, so we rely on mathematical techniques to summarize what’s happening.

The problem gets worse when those features are correlated, meaning they tend to move together. In a patient record, blood pressure and BMI often rise and fall in the same direction. For example, if BMI increases from 22 to 30 and blood pressure rises alongside it, those two measurements are partly telling the same story. Cholesterol and glucose levels often behave similarly. When features are correlated, they’re partially telling you the same thing, just from slightly different angles. The result is a dataset that looks information-rich but contains a lot of repetition.

This repetition creates two practical problems. It makes the data hard to visualize. You can plot two variables against each other, maybe three if you’re creative, but beyond that you’re stuck. You can’t see the patterns. It also causes trouble for many ML algorithms. When features are essentially duplicates of each other, models can become unstable or assign inconsistent importance to them, making results harder to train and harder to trust.

PCA addresses both problems by reorganizing the data. Instead of working with your original measurements, PCA combines them into a smaller set of component scores, new values that capture the most important differences across your observations. You go from many correlated features to a few independent ones while preserving as much useful information as possible. This is called dimensionality reduction, taking data that lives in many dimensions and representing it in fewer, while keeping the essential patterns intact.

What PCA Actually Does

PCA looks across all your measurements and asks: where are the biggest differences between observations? Not which features you happened to collect, but where the actual variation lives in the data.

Variation, also called variance, just means how much things differ from one another. If every patient in your dataset has nearly identical cholesterol levels, that measurement isn’t helping you tell patients apart. Because it doesn’t distinguish one patient from another, it provides very little useful information for finding patterns.

Once PCA finds where the biggest differences are, it creates new measurements called principal components. Think of each component as a summary score – a single number that captures as much of the variation across your data as possible. The first component captures the most. The second captures the next most, without overlapping with the first. Each one is independent, so they’re not repeating information the way your original correlated features were.

The practical result is simpler, transformed data. Instead of describing each observation across fifty original features, you’re now describing it across a handful of components. And because the components are ordered by how much they capture, you can usually keep just the first few and still hold onto most of the key information in the data.

The Math, Briefly

PCA works by finding the directions in your data along which the variation is greatest. You can think of these directions as lines that show the main trends in the data. Mathematically, they are called eigenvectors. The word sounds intimidating, but the idea is simple: an eigenvector is just a direction through your data. You don’t need to calculate these yourself. Software handles the math. What matters is understanding that PCA is finding the most informative trends in the data.

PCA finds the direction where observations spread out the most, then the next greatest spread that is independent of the first, and so on. The amount of variation each direction captures is called its eigenvalue. A higher eigenvalue means that its corresponding component explains more of the differences between observations. When you run PCA, you’ll see something called the explained variance ratio for each component. This tells you what percentage of the total variation in the data that component accounts for. If the first three components explain 87% of the variance, you can represent your data in three dimensions instead of fifty and still preserve most of the meaningful differences between observations. Running PCA gives you a new table of component scores, where each observation is described by its values on the principal components instead of the original features.

How many components to keep depends on the problem. A common approach is to create a scree plot, which shows how much each component contributes to explaining the variation in your data, and look for the point where the curve starts to level off. That’s usually where additional components add little new insight. Beyond that point, you get diminishing returns. The first few components often do the heavy lifting, and the rest capture more noise than signal. If you want to see how the math plays out visually, this StatQuest video is worth ten minutes of your time.

When to Use PCA in Machine Learning

PCA is worth reaching for in a few specific situations. The most common is when you have too many features and they’re highly correlated. If you’re seeing slow model training, unstable results, or just a dataset that feels unwieldy, correlated features are often the reason. PCA cleans that up.

The second is when you want to visualize your data. You can’t plot fifty dimensions, but you can plot two or three principal components and see the shape of your data, whether observations cluster into groups, whether there are outliers, whether any obvious structure exists. This kind of exploratory work, looking at your data before you decide what to model, is where PCA is especially valuable. It helps you understand what you’re working with before you commit to a modeling approach.

The third is noise reduction. Real-world data is messy. Some of what’s in your features is signal, and some is just random fluctuation. Because PCA focuses on the directions of greatest variation, the later components, the ones that explain very little variance, tend to capture noise. Dropping them gives your model cleaner data to learn from.

One situation where PCA is less appropriate: when you need to explain exactly which original features are driving your results. Because principal components are combinations of your original features, they’re harder to interpret directly. If a stakeholder needs to know whether blood pressure specifically is influencing an outcome, working with PCA-transformed data makes that harder to answer.

PCA in Practice

Genomics researchers work with this problem constantly. A single gene expression study might measure activity levels across tens of thousands of genes for each patient. Running PCA on that data reduces it to a manageable number of components while preserving the genetic variation that actually differentiates the samples. Researchers can then plot patients in two or three dimensions and visually identify clusters, groups of patients whose gene expression profiles are similar. Those clusters often correspond to disease subtypes or treatment responses that would have been invisible in the original high-dimensional space.

The same logic applies in industrial settings. Sensor data from manufacturing equipment is almost always high-dimensional and highly correlated. Temperature, pressure, and vibration tend to rise and fall together when a machine is under stress. PCA compresses that data into a small number of components that reflect the underlying machine states, making it easier to spot anomalies that might indicate a developing fault.

In both cases, PCA isn’t doing the final analysis. It’s preparing the data so that the next step has something clean to work with, whether that’s clustering observations into groups of similar cases, classifying them into predefined categories like “at risk” or “healthy,” or flagging unusual patterns that don’t fit the norm.

Where PCA Fits

PCA is unsupervised, meaning it doesn’t use labels or outcomes to do its work. Supervised learning, which we’ve covered in earlier posts, learns from labeled examples. PCA doesn’t need any of that. It looks only at the structure of the input data and finds the directions of greatest variation. That makes it useful any time you want to understand your data before you start predicting anything.

That’s what connects PCA to the broader theme of this blog. The goal in ML isn’t to throw as many features as possible at an algorithm and hope something sticks. PCA is one of the most reliable tools helping you do exactly that. Patterns first. The algorithms come after.