A comparative study on how Principal Component Analysis — applied before and after normalization — affects the performance of a Decision Tree classifier on the Wisconsin Diagnostic Breast Cancer dataset.
Early and accurate detection of breast cancer is one of the most critical challenges in clinical medicine. In this project, I explore how dimensionality reduction via PCA (Principal Component Analysis) interacts with a Decision Tree classifier when diagnosing whether a tumor is Malignant (M) or Benign (B) — and whether normalizing features before PCA makes a meaningful difference.
Diagram Decision Tree – Breast Cancer Classification Original Data
The Wisconsin Diagnostic Breast Cancer (WDBC) dataset has 30 numeric features computed from digitized images of fine needle aspirate (FNA) of breast masses. These features describe characteristics of cell nuclei such as radius, texture, perimeter, area, and smoothness. At 30 features, the dataset is a perfect candidate for exploring the effects of PCA.
“PCA doesn’t just reduce noise — it reveals the geometric skeleton of your data. But how many principal components you choose is an art as much as a science.”
The central question driving this experiment is: Does applying PCA improve, hurt, or keep model performance the same? And does normalizing the data before running PCA change the picture? I designed four experiments to answer these questions head-on.
The Four Experiments at a Glance
Model 01
Decision Tree — Original Data
Raw 30-feature dataset, no preprocessing beyond splitting.
Model 02
Decision Tree — Normalized Data
StandardScaler applied. All features centered and scaled to unit variance.
Model 03
Decision Tree — PCA on Original
PCA applied directly on raw data, reduced to 2 principal components.
Model 04
Decision Tree — Normalize → PCA
StandardScaler first, then PCA reducing to 17 principal components.
02 — Data Source
The Dataset: Wisconsin Diagnostic Breast Cancer
The dataset comes from the UCI Machine Learning Repository (dataset ID 17), one of the most well-known and widely-used repositories for academic machine learning research. It was originally created at the University of Wisconsin–Madison.
In this project, the dataset is fetched programmatically using the ucimlrepo Python package, which removes the need to manually download files.
The target column is Diagnosis , with two categories: M (Malignant) — a cancerous tumor — and B (Benign) — a non-cancerous growth. This column is later encoded numerically: M = 1 , B = 0 .
The dataset is moderately imbalanced — about 63% Benign vs 37% Malignant — but not so severely that it would demand resampling techniques. The 30 features are organized into three groups of 10, each describing the mean, standard error, and worst (largest) value of a cell nucleus characteristic. These 10 characteristics are: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension.
Why This Dataset Is Perfect for PCA ??
With 30 features derived from just 10 physical measurements (mean, SE, worst), there is significant multicollinearity in the data — features like radius, perimeter, and area are mathematically correlated. PCA is designed to handle exactly this situation by finding directions of maximum variance that are uncorrelated with each other.
03 — Tools & Libraries
Libraries Used
The project relies entirely on the Python scientific computing ecosystem. Here is every library used and its role in the pipeline:
pd
pandas
DataFrame manipulation, merging features and labels, exploratory analysis.
np
numpy
Numerical arrays, cumulative sums, np.where for label encoding.
sk
scikit-learn
The core ML library: splitting, scaling, PCA, Decision Tree, GridSearchCV, metrics.
plt
matplotlib
Base plotting: scree plots, bar charts, inline figures.
sns
seaborn
Confusion matrix heatmaps, styled distribution plots.
mn
missingno
Visualizing missing data patterns (no missing data found here).
uci
ucimlrepo
Programmatic access to UCI ML Repository datasets by ID.
PCA
sklearn PCA
Principal Component Analysis for dimensionality reduction.
SS
StandardScaler
Feature normalization — zero mean, unit variance.
DT
DecisionTree
The base classifier used in all four experiments.
GS
GridSearchCV
Exhaustive hyperparameter search with 3-fold cross-validation.
Pl
Pipeline
Wraps classifier steps to prevent data leakage during CV.
Before we get to PCA, we need to talk about StandardScaler — because normalization is not optional when working with PCA. It is arguably the most important preprocessing step.
What StandardScaler Does
StandardScaler from sklearn.preprocessing transforms each feature so that it has a mean of 0 and a standard deviation of 1. This is known as Z-score standardization. For each feature column, the transformation is:
Where X is the original value, μ is the mean of the feature, and σ is the standard deviation.
After transformation, every feature will have μ = 0 and σ = 1 — they are all on the same scale, regardless of their original units (mm, mm², percentages, etc.).
PCA works by finding directions (principal components) in the data that explain the most variance. If your features have wildly different scales, PCA will be dominated by whichever feature has the largest numerical range — not because it’s more informative, but because it has larger numbers.
In this dataset, features like area (values in the hundreds or thousands) would completely overshadow features like smoothness (values between 0 and 0.2) if we don’t normalize first. The variance of area is orders of magnitude larger than the variance of smoothness, even though smoothness might carry equally important diagnostic information.
!! Critical Warning !!
StandardScaler should be fit on training data only, and then used to transform both training and test sets. Fitting on the full dataset before splitting introduces data leakage — the model indirectly “sees” test data statistics during training. In this project, fit_transform(X) is applied to the full dataset before splitting, which is technically a leakage — something to be mindful of in production settings.
Normalization and the Decision Tree: An Interesting Nuance
Here’s something worth noting: Decision Trees are inherently scale-invariant. Unlike algorithms like SVM or K-Nearest Neighbors, a Decision Tree splits features based on thresholds. Multiplying or shifting a feature doesn’t change where the optimal splits occur — it only changes the numerical value of the threshold, not the structure of the tree.
This explains one of the key findings in this project: Model 1 (original data) and Model 2 (normalized data) produce identical results. StandardScaler had no effect on the Decision Tree’s accuracy because the tree’s logic — which threshold splits the classes most cleanly — is unchanged by standardization.
!! Key Insight !! Normalization matters for distance-based and gradient-based algorithms (SVM, KNN, logistic regression, neural networks). For tree-based algorithms (Decision Tree, Random Forest, XGBoost), it typically has no effect on accuracy. The real value of normalizing before PCA is what it does to PCA’s component extraction — not what it does to the tree itself.1‘2
05 — Core Concept
Principal Component Analysis: Theory & Intuition
PCA (Principal Component Analysis) is a linear dimensionality reduction technique. Its goal is to transform a high-dimensional dataset into a smaller set of new variables called principal components, while preserving as much of the original information (variance) as possible.
Think of it this way: imagine your data as a cloud of points in 30-dimensional space. PCA rotates and reorients that cloud so that the axis with the greatest spread of data becomes the first new axis (PC1), the second greatest spread becomes PC2, and so on. You can then keep only the first few axes and still retain most of the story your data tells.
How PCA Works: Step by Step
Standardize the data (strongly recommended — this ensures all features contribute equally to variance computation).
Compute the covariance matrix of the standardized features to understand how features vary relative to each other.
Compute eigenvectors and eigenvalues of the covariance matrix. Each eigenvector is a principal component; each eigenvalue tells us how much variance that component captures.
Sort by eigenvalue (descending). The eigenvector with the highest eigenvalue is PC1 — the direction of greatest variance.
Project the data onto the top k eigenvectors. Your 30-dimensional dataset is now k-dimensional, retaining most of the information.
What PCA Preserves — and What It Loses
PCA preserves the global structure and variance of the data. The principal components are linear combinations of the original features, carefully constructed so that the first few components explain the bulk of the dataset’s variability.
What PCA loses is interpretability: the new features (PC1, PC2, …) are no longer “radius” or “area” — they are abstract mathematical combinations of all original features. You also lose some information: if you keep 2 components out of 30, you’re discarding the variance that lived in the other 28 directions. How much variance you keep depends entirely on how many components you choose.
!! A Critical PCA Rule !!
The principal components are guaranteed to be orthogonal (perpendicular) to each other. This means they are completely uncorrelated. One of PCA’s greatest benefits is transforming a set of correlated features (like radius, perimeter, area) into a new set of uncorrelated components — which is especially helpful for algorithms sensitive to multicollinearity.
06 — PCA in Practice
Deciding the Number of Components
This is where PCA becomes a judgment call. There is no single universal answer to “how many principal components should I keep?” — and that is actually one of the most important things to understand about PCA in practice.
The standard approach is to use a scree plot or a cumulative explained variance curve, both of which this project generates from the explained_variance_ratio_ attribute of the fitted PCA object.
A common heuristic is to keep enough components to explain 95% or 99% of the variance. These thresholds are drawn as horizontal dashed lines on the cumulative variance plot. The point where the curve crosses the red 95% line tells you the minimum number of components needed to retain 95% of the information.
Scree plot number of components from PCA normalized data
For the normalized WDBC dataset, the plot reveals that:
Roughly 10 components capture ~95% of total variance.
Roughly 17 components capture ~99% of total variance.
The first 2 components alone capture a surprisingly large portion when applied to raw (non-normalized) data — because area and perimeter’s large values dominate.
Scree plot number of components from PCA original / raw data
?? The Subjectivity of Choosing k ??
There is no mathematically “correct” number of components. Choosing k is a trade-off between information retained and dimensionality reduced. In this project, two different choices were made: k=2 for PCA on raw data, and k=17 for PCA on normalized data. This subjectivity is intentional — it is part of the modeling process, and the scree plot is simply a visual tool to inform that decision. A practitioner with domain knowledge might make a different choice based on downstream model performance or computational constraints.
Each of the four models uses the same base classifier — Decision Tree — with GridSearchCV for hyperparameter tuning and 3-fold cross-validation to prevent overfitting during tuning. The grid searches over depth, leaf size, split criteria, and more.
The baseline model. Raw 30-feature data is split directly into training and test sets, then fed into a Decision Tree wrapped in a scikit-learn Pipeline.
StandardScaler is applied to all features before splitting. The Decision Tree is then trained on the scaled features. As discussed in the StandardScaler section, this is expected to make no difference for a tree-based algorithm.
Model 02 — Normalized Data
Step 1
Raw 30-Feature Data
Step 2
StandardScaler → X_norm
Step 3
Train/Test Split 70/30
Step 4
GridSearchCV + Decision Tree
Model 3 — Decision Tree on PCA (2 Components, Original Data)
PCA is applied directly to the raw, unnormalized data. Because features like area have enormous values compared to smoothness, the first principal component will be heavily dominated by high-scale features. Only 2 components are kept (a very aggressive reduction from 30 to 2), so there is significant information loss.
Model 03 — PCA on Original Data (k=2)
Step 1
Raw 30-Feature Data
Step 2
PCA(n=2) → PC1, PC2
Step 3
Train/Test Split 70/30
Step 4
GridSearchCV + Decision Tree
Model 4 — Decision Tree on Normalize → PCA (17 Components)
The most theoretically sound pipeline. Normalization first ensures PCA distributes importance fairly across all features. 17 components are kept — retaining approximately 99% of the dataset’s variance, reducing dimensionality by nearly half (from 30 to 17).
Model 04 — Normalize → PCA (k=17)
Step 1
Raw 30-Feature Data
Step 2
StandardScaler → X_norm
Step 3
PCA(n=17) → PC1–PC17
Step 4
Train/Test Split 70/30
Evaluation Metrics
All four models are evaluated using three standard classification metrics:
Metric
Definition
Importance in Cancer Context
Accuracy
(TP + TN) / Total
Overall correctness — how often the model is right.
Precision
TP / (TP + FP)
Of all predicted malignant cases, how many actually were?
Recall
TP / (TP + FN)
Of all actual malignant cases, how many did the model catch? (Clinically critical — missing a malignant tumor is dangerous.)
In medical diagnostics, recall is often more important than accuracy or precision. A false negative (predicting Benign when it’s actually Malignant) can have life-threatening consequences, while a false positive leads to additional tests — unpleasant, but not lethal.
08 — Results
Model Performance Comparison
After running all four GridSearchCV experiments with 3-fold cross-validation and evaluating on the held-out test set, the results reveal a clear and informative pattern.
DT — Original Data
93.6%
DT — Normalized Data
93.6%
DT — PCA Original (k=2)
~86%
DT — Normalize + PCA (k=17)
~91%
Model 01
Decision Tree — Original Data
Raw 30-feature dataset, no preprocessing beyond splitting.
Train Accuracy~100%
Test Accuracy~93.6%
Precision~93.6%
Recall~93.6%
Model 02
Decision Tree — Normalized Data
StandardScaler applied. All features centered and scaled to unit variance.
Train Accuracy~100%
Test Accuracy~93.6%
Precision~93.6%
Recall~93.6%
Model 03
Decision Tree — PCA on Original Data
PCA applied directly on raw data, reduced to 2 principal components.
Train Accuracy~88%
Test Accuracy~86%
Precision~86%
Recall~86%
Model 04
Decision Tree — Normalize → PCA
StandardScaler first, then PCA reducing to 17 principal components (~99% variance).
Train Accuracy~96%
Test Accuracy~91%
Precision~91%
Recall~91%
Interpreting the Results
Finding 1: Models 1 and 2 are identical. As predicted, StandardScaler has no effect on Decision Tree performance. Both models find the same optimal splits with the same accuracy, precision, and recall. This is a direct confirmation that tree-based models are scale-invariant.
Finding 2: PCA on raw data (Model 3) significantly hurts performance. Reducing to just 2 components without normalizing first is a double problem. First, those 2 components are biased toward high-variance, high-scale features. Second, two dimensions simply cannot capture the complexity of a 30-feature classification task. A drop to ~86% accuracy confirms this — the model lost too much discriminative information.
Finding 3: Normalize + PCA (Model 4) recovers most of the performance. By normalizing first and keeping 17 components (preserving ~99% variance), the model achieves ~91% test accuracy — significantly better than raw PCA, and only slightly below the full-feature models. This demonstrates that proper PCA application (normalize → then reduce) is a viable strategy that reduces dimensions from 30 to 17 with minimal information loss.
!! The Big Takeaway !!
The order of operations matters enormously in PCA pipelines. Normalize first, then PCA. Without normalization, PCA is biased and aggressive component reduction causes dramatic accuracy drops. With proper normalization, PCA can be an effective tool for creating leaner, faster models with only a modest accuracy trade-off — especially valuable when computational cost matters or when dealing with hundreds or thousands of features.
09 — Takeaways
Conclusion
This project set out to investigate how PCA — applied in different ways — changes the behavior of a Decision Tree classifier on a real medical dataset. The four experiments provide clear, interpretable answers.
✔ What Worked
Normalize → PCA (k=17) retained ~91% accuracy while reducing features from 30 to 17
GridSearchCV ensured fair hyperparameter comparison across all models
Scree plots clearly visualized where the variance “knee” occurs
The project confirmed theoretical expectations about scale-invariant classifiers
!! What to Watch Out For
PCA on unnormalized data (Model 3, k=2) caused a significant accuracy drop
Fitting StandardScaler on the full dataset before splitting is technically data leakage
Choosing k is subjective — different choices lead to different accuracy trade-offs
Decision Trees with full features tend to overfit training data (100% train accuracy)
Lessons Learned About PCA
PCA is a powerful tool, but it requires careful setup. The three most important lessons from this experiment are:
1. Always normalize before PCA. Without normalization, features with large numerical ranges dominate the principal components. The result is a biased reduction that may discard important information from small-scale features.
2. Choosing the number of components is a design decision, not a formula. The scree plot is your best tool — but where you draw the line between “enough” and “too many” depends on your use case. A 95% variance cutoff is common in practice; 99% is more conservative. In production, you’d typically validate your choice against downstream model performance.
3. PCA trades accuracy for efficiency. In this project, going from 30 features to 17 (with proper normalization) cost about 2–3% accuracy. Whether that trade-off is worth it depends on the application. For a cancer screening tool, that cost might be unacceptable. For a high-throughput pre-screening step, it might be fine.
“The best model is not always the most accurate one — it’s the one that balances accuracy, interpretability, computational cost, and the real-world stakes of its mistakes.”
Further Improvements
This experiment focused on Decision Trees, which are scale-invariant and not the ideal classifier to showcase PCA’s benefits. Future work could include:
Using a K-Nearest Neighbors (KNN) or Support Vector Machine (SVM) classifier, where normalization and PCA can more dramatically change results.
Using cross-validated grid search for k — treating the number of components as a hyperparameter.
Applying proper train-only fitting of StandardScaler and PCA inside the Pipeline to eliminate data leakage.
Examining feature contributions to principal components to understand which original features matter most.
Leave a Reply