A Friendly Introduction to Principal Component Analysis

Most datasets don’t have two or three measurements per observation. They have dozens, sometimes hundreds. A patient record might include blood pressure, cholesterol, glucose levels, BMI, age, and dozens of lab results. A manufacturing sensor array might log temperature, vibration, pressure, humidity, and electrical output from every machine on the floor, every few seconds. The data is rich, but that richness creates a problem: when everything is measured, it gets hard to see anything.

That’s the problem Principal Component Analysis, or PCA, was built to solve.

Table of Contents

When More Data Makes Machine Learning Harder

When a dataset has many features, say fifty or a hundred measurements per observation, we call it high-dimensional. The dimensions are just the number of features, and once you get past three, you can’t visualize the data at all. What you can’t see, you can’t reason about easily.

The problem gets worse when those features are correlated, meaning they tend to move together. In a patient record, blood pressure and BMI often rise and fall in the same direction. Cholesterol and glucose levels might too. When features are correlated, they’re partially telling you the same thing, just from slightly different angles. You end up with a dataset that looks information-rich but contains a lot of repetition.

This repetition creates two practical problems. First, it’s hard to visualize. You can plot two variables against each other, maybe three if you’re creative, but beyond that you’re stuck. You can’t see the patterns. Second, many ML algorithms struggle with highly correlated features. They can get confused about which features are doing the real work, which makes models harder to train and harder to trust.

PCA addresses both problems by reorganizing the data. Instead of working with your original measurements, PCA combines them into a smaller set of scores that capture the most important differences across your observations. You go from many correlated features to a few independent ones, and you lose as little information as possible in the process. This is called dimensionality reduction, taking data that lives in many dimensions and representing it in fewer, while keeping the essential patterns intact.

What PCA Actually Does

PCA looks across all your measurements and asks: where are the biggest differences between observations? Not which features you happened to collect, but where the actual variation lives in the data.

Variation, also called variance, just means how much things differ from one another. If every patient in your dataset has nearly identical cholesterol levels, that measurement isn’t helping you tell patients apart. It has low variance, which means low information. The features that differ widely across your observations are the ones carrying the real signal. PCA finds where that signal is concentrated.

Once PCA finds where the biggest differences are, it creates new measurements called principal components. Think of each component as a summary score — a single number that captures as much of the variation across your data as possible. The first component captures the most. The second captures the next most, without overlapping with the first. Each one is independent, so they’re not repeating information the way your original correlated features were.

The practical result is simpler, transformed data. Instead of describing each observation across fifty original features, you’re now describing it across a handful of components. And because the components are ordered by how much they capture, you can usually keep just the first few and still hold onto most of what matters in the data.

The Math, Briefly

PCA works by finding the directions in your data along which the variation is greatest. These directions are mathematically called eigenvectors. The word sounds intimidating but the idea is simple: an eigenvector is just a direction through your data. PCA finds the direction where observations spread out the most, then the next direction of greatest spread that is independent of the first, and so on.

The amount of variation each direction captures is called its eigenvalue. A higher eigenvalue means that component is explaining more of the differences in your data.

When you run PCA, you’ll see something called the explained variance ratio for each component. This tells you what percentage of the total variation in the data that component accounts for. If the first three components explain 87% of the variance, you can represent your data in three dimensions instead of fifty and still preserve 87% of the differences between observations. That’s usually a reasonable trade.

How many components to keep depends on the problem. A common approach is to plot how much each component contributes to explaining the differences across your data, then look for the point where the curve flattens out. That’s usually where meaningful signal ends and noise begins. Adding more components beyond that point gives you diminishing returns. In practice, the first few components often do the heavy lifting, and the rest are capturing noise more than signal. If you want to see how the math plays out visually, this StatQuest video is worth ten minutes of your time.

When to Use PCA in Machine Learning

PCA is worth reaching for in a few specific situations.

The most common is when you have too many features and they’re highly correlated. If you’re seeing slow model training, unstable results, or just a dataset that feels unwieldy, correlated features are often the reason. PCA cleans that up.

The second is when you want to visualize your data. You can’t plot fifty dimensions, but you can plot two or three principal components and see the shape of your data, whether observations cluster into groups, whether there are outliers, whether any obvious structure exists. This kind of exploratory work, looking at your data before you decide what to model, is where PCA is especially valuable. It helps you understand what you’re working with before you commit to a modeling approach.

The third is noise reduction. Real-world data is messy. Some of what’s in your features is signal, and some is just random fluctuation. Because PCA focuses on the directions of greatest variation, the later components, the ones that explain very little variance, tend to capture noise. Dropping them gives your model cleaner data to learn from.

One situation where PCA is less appropriate: when you need to explain exactly which original features are driving your results. Because principal components are combinations of your original features, they’re harder to interpret directly. If a stakeholder needs to know whether blood pressure specifically is influencing an outcome, working with PCA-transformed data makes that harder to answer.

PCA in Practice

Genomics researchers work with this problem constantly. A single gene expression study might measure activity levels across tens of thousands of genes for each patient. Running PCA on that data reduces it to a manageable number of components while preserving the genetic variation that actually differentiates the samples. Researchers can then plot patients in two or three dimensions and visually identify clusters, groups of patients whose gene expression profiles are similar. Those clusters often correspond to disease subtypes or treatment responses that would have been invisible in the original high-dimensional space.

The same logic applies in industrial settings. Sensor data from manufacturing equipment is almost always high-dimensional and highly correlated. Temperature, pressure, and vibration tend to rise and fall together when a machine is under stress. PCA compresses that data into a small number of components that reflect the underlying machine states, making it easier to spot anomalies that might indicate a developing fault.

In both cases, PCA isn’t doing the final analysis. It’s preparing the data so that the next step has something clean to work with, whether that’s clustering observations into groups of similar cases, classifying them into predefined categories like “at risk” or “healthy,” or flagging unusual patterns that don’t fit the norm.

Where PCA Fits

PCA is unsupervised, meaning it doesn’t use labels or outcomes to do its work. Supervised learning, which we’ve covered in earlier posts, learns from labeled examples. PCA doesn’t need any of that. It looks only at the structure of the input data and finds the directions of greatest variation. That makes it useful any time you want to understand your data before you start predicting anything.

That’s also what connects PCA to the broader theme of this blog. The goal in ML isn’t to throw as many features as possible at an algorithm and hope something sticks. The goal is to find the patterns that matter and give your models something clean and meaningful to learn from. PCA is one of the most reliable tools for doing exactly that. Patterns first. The algorithms come after.

A Friendly Introduction to Principal Component Analysis

When More Data Makes Machine Learning Harder

What PCA Actually Does

The Math, Briefly

When to Use PCA in Machine Learning

PCA in Practice

Where PCA Fits

Bias vs. Variance: Why Your ML Model Can’t Have…

Training vs. Testing: Why Your Model Needs to Prove…

Prediction vs Inference: Different Goals in ML Analysis

Leave a Reply Cancel reply