
ISTQB CT-AI: Machine Learning Fundamentals for Testers
Machine learning powers most AI systems you'll encounter as a tester. Understanding how ML works - not at implementation depth, but at a conceptual level - enables you to design effective tests, communicate with ML teams, and identify potential failure modes.
This article covers CT-AI syllabus chapters 3-5: Machine Learning Overview, ML Data, and ML Functional Performance Metrics. You'll learn the different types of machine learning, how data quality affects model quality, and how to interpret the metrics used to evaluate ML systems.
Table Of Contents-
- Chapter 3: Machine Learning Overview
- Types of Machine Learning
- The ML Workflow
- Overfitting and Underfitting
- Chapter 4: ML Data
- Data Quality Dimensions
- Data Preparation
- Data Splitting Strategies
- Data Drift
- Chapter 5: ML Performance Metrics
- The Confusion Matrix
- Classification Metrics
- Regression Metrics
- Benchmarking and Baselines
- Frequently Asked Questions
Chapter 3: Machine Learning Overview
Machine learning enables systems to learn patterns from data rather than following explicitly programmed rules. This fundamental shift changes how we think about testing: we're not just testing code, we're testing the combination of code, algorithms, and data.
What Makes ML Different
Traditional software:
- Behavior is explicitly coded
- Same input always produces same output
- Bugs are in the code
- Testing verifies code matches specification
Machine learning:
- Behavior emerges from data
- Outputs may vary (probabilistic)
- Problems can be in code, data, or their interaction
- Testing verifies the learned behavior meets requirements
This difference has profound implications for testing. You can't just verify that the code runs correctly - you need to verify that the model learned appropriate patterns.
Types of Machine Learning
ML systems are categorized by how they learn. Understanding the type helps you design appropriate tests.
Supervised Learning
In supervised learning, models learn from labeled examples. You provide inputs (features) and correct outputs (labels), and the model learns to map inputs to outputs.
Training process:
- Collect labeled data (input-output pairs)
- Model learns patterns connecting inputs to outputs
- Model can predict outputs for new inputs
Common applications:
Classification: Predicting categories
- Spam detection (spam vs not spam)
- Image recognition (cat vs dog vs bird)
- Medical diagnosis (disease present vs absent)
- Sentiment analysis (positive vs negative vs neutral)
Regression: Predicting continuous values
- House price prediction
- Demand forecasting
- Time-to-failure prediction
- Temperature prediction
Testing considerations for supervised learning:
- Quality depends heavily on label accuracy
- Model performance on training data may not reflect real-world performance
- Edge cases may have few or no labeled examples
- Label imbalance (many more of one class) affects learning
Exam Tip: Be able to identify whether a scenario involves classification (predicting categories) or regression (predicting numeric values). This affects which metrics apply.
Unsupervised Learning
In unsupervised learning, models find patterns in data without labeled examples. You provide inputs only, and the model discovers structure.
Common approaches:
Clustering: Grouping similar items
- Customer segmentation
- Document categorization
- Anomaly detection (what doesn't fit clusters)
- Gene expression analysis
Dimensionality reduction: Simplifying data
- Feature compression
- Visualization of high-dimensional data
- Noise reduction
Association: Finding relationships
- Market basket analysis (items bought together)
- Pattern discovery
Testing considerations for unsupervised learning:
- No "correct" answer makes evaluation harder
- Cluster quality is often subjective
- Results may be unstable (different runs produce different clusters)
- Business interpretation is needed to validate meaningfulness
Reinforcement Learning
In reinforcement learning, agents learn through trial and error, receiving rewards or penalties for actions.
Components:
- Agent: The learning system
- Environment: Where the agent operates
- Actions: What the agent can do
- Rewards: Feedback on action quality
- State: Current situation
Common applications:
- Game playing (chess, Go, video games)
- Robotics control
- Resource management
- Recommendation systems
Testing considerations for reinforcement learning:
- Emergent behavior is hard to predict
- Performance depends on reward function design
- Exploration vs exploitation trade-offs affect behavior
- Testing requires simulation environments
- Safety constraints need explicit testing
Semi-Supervised Learning
Semi-supervised learning uses a small amount of labeled data with a large amount of unlabeled data. This is practical when labeling is expensive but unlabeled data is plentiful.
Testing considerations:
- Label quality on the small labeled set is critical
- Model may propagate errors from mislabeled examples
- Performance on unlabeled data is hard to evaluate directly
Transfer Learning
Transfer learning uses a model trained on one task as a starting point for a different but related task. For example, a model trained on general image recognition might be fine-tuned for medical imaging.
Testing considerations:
- The source and target domains should be appropriately related
- Fine-tuning may not overcome fundamental mismatches
- Biases from source domain may transfer
The ML Workflow
Understanding the ML development workflow helps you identify where testing fits and what can go wrong at each stage.
1. Problem Definition
Before any data or modeling:
- Define the business problem
- Determine if ML is the right approach
- Specify success criteria
- Identify constraints (latency, interpretability, fairness)
Testing relevance: Clear problem definition enables meaningful test criteria. Vague objectives lead to undefined pass/fail criteria.
2. Data Collection
Gathering data for training:
- Identify data sources
- Collect representative samples
- Ensure sufficient volume
- Address legal and privacy requirements
Testing relevance: Data collection problems become model problems. Test data source reliability and representativeness.
3. Data Preparation
Transforming raw data for model consumption:
- Clean data (handle missing values, outliers, errors)
- Transform features (normalization, encoding)
- Engineer features (create derived variables)
- Select features (choose relevant inputs)
Testing relevance: Preparation pipelines can introduce bugs. Test that transformations work correctly and consistently.
4. Model Training
Learning patterns from prepared data:
- Choose algorithm(s) to try
- Train models on training data
- Tune hyperparameters
- Validate on held-out data
Testing relevance: Training configuration affects results. Test that training completes successfully and reproducibly.
5. Model Evaluation
Assessing model quality:
- Measure performance on test data
- Evaluate against baseline
- Check for overfitting
- Assess fairness and other quality characteristics
Testing relevance: This is where most explicit testing occurs. Comprehensive evaluation is essential before deployment.
6. Model Deployment
Moving model to production:
- Package model for deployment
- Integrate with applications
- Configure serving infrastructure
- Set up monitoring
Testing relevance: Deployment can break things that worked in development. Test end-to-end integration and performance.
7. Model Monitoring
Ongoing observation in production:
- Track performance metrics
- Detect data drift
- Monitor for bias emergence
- Alert on degradation
Testing relevance: Production testing continues after deployment. Monitoring is testing in production.
8. Model Maintenance
Keeping models current:
- Retrain on new data
- Update for changed requirements
- Address discovered issues
- Retire obsolete models
Testing relevance: Model updates need testing like code updates. Regression testing for ML includes performance comparison.
Overfitting and Underfitting
Two fundamental problems affect ML model quality.
Overfitting
Definition: The model learns training data too well, including noise and coincidental patterns that don't generalize.
Symptoms:
- Excellent performance on training data
- Poor performance on new data
- Model is overly complex
- Predictions are overconfident
Causes:
- Model is too complex for the data
- Training too long
- Insufficient training data
- Training data not representative
Testing implications:
- Compare performance on training vs test data
- Large gaps indicate overfitting
- Test on data from different times or sources
- Evaluate with cross-validation
Underfitting
Definition: The model is too simple to capture patterns in the data.
Symptoms:
- Poor performance on training data
- Poor performance on new data
- Model doesn't capture known patterns
- Predictions are too generic
Causes:
- Model is too simple
- Insufficient features
- Not enough training
- Poor feature engineering
Testing implications:
- Performance should exceed naive baselines
- Model should capture known patterns
- Feature importance should make sense
The Bias-Variance Trade-off
Overfitting and underfitting relate to a fundamental trade-off:
Bias: Error from overly simple assumptions (underfitting) Variance: Error from sensitivity to training data fluctuations (overfitting)
The goal is finding appropriate complexity that minimizes total error.
Chapter 4: ML Data
Data quality determines model quality. "Garbage in, garbage out" applies doubly to ML: bad data not only produces bad outputs but teaches bad patterns.
Data Quality Dimensions
Accuracy
Data should correctly represent reality.
Problems:
- Measurement errors
- Transcription mistakes
- Outdated information
- Misrecorded values
Testing:
- Validate against known sources
- Check for impossible values
- Compare to physical constraints
- Audit sample records
Completeness
Data should include all relevant information.
Problems:
- Missing values
- Incomplete records
- Unrecorded events
- Sampling gaps
Testing:
- Count missing values
- Analyze missingness patterns
- Evaluate coverage of important segments
- Compare to expected data volumes
Consistency
Data should be consistent within itself and with other sources.
Problems:
- Conflicting values for same entity
- Different formats for same type
- Inconsistent labeling
- Version conflicts
Testing:
- Cross-reference related records
- Check referential integrity
- Verify consistency with business rules
- Compare to authoritative sources
Timeliness
Data should reflect current reality.
Problems:
- Stale information
- Delayed updates
- Historical bias
- Concept drift
Testing:
- Check data timestamps
- Evaluate freshness requirements
- Test with recent data
- Monitor for temporal patterns
Relevance
Data should be appropriate for the problem.
Problems:
- Irrelevant features
- Proxy variables instead of direct measures
- Data from wrong population
- Features not available at prediction time
Testing:
- Validate feature relevance with domain experts
- Check that training features match deployment availability
- Evaluate if data represents target population
Data Preparation
Raw data rarely works directly for ML. Preparation transforms it into usable form.
Data Cleaning
Handling missing values:
- Remove records with missing values
- Impute missing values (mean, median, mode)
- Create "missing" indicator features
- Use algorithms that handle missing data
Testing: Verify cleaning doesn't introduce bias. Test that imputation strategies are appropriate.
Handling outliers:
- Identify outliers (statistical methods, domain knowledge)
- Remove, cap, or transform outliers
- Investigate outlier causes
Testing: Verify outlier handling preserves legitimate edge cases.
Handling errors:
- Detect inconsistent or impossible values
- Correct fixable errors
- Remove unfixable corrupted records
Testing: Verify error detection catches known issues.
Feature Engineering
Creating useful inputs from raw data.
Transformations:
- Normalization (scaling to standard range)
- Encoding categorical variables
- Creating interaction features
- Binning continuous variables
Testing: Verify transformations are applied consistently between training and inference.
Data Labeling
Creating ground truth for supervised learning.
Challenges:
- Labeling is expensive and time-consuming
- Human labelers make mistakes
- Ambiguous cases may have no correct label
- Labeler bias affects labels
Quality assurance:
- Multiple labelers for same data
- Inter-annotator agreement metrics
- Clear labeling guidelines
- Regular quality audits
Testing: Evaluate label quality, consistency, and potential bias.
Exam Tip: Questions about data often focus on how data problems manifest as model problems. Poor data quality leads to poor model quality, regardless of algorithm sophistication.
Data Splitting Strategies
ML development uses different data sets for different purposes.
Training Set
Used to train the model. The model learns patterns from this data.
Characteristics:
- Largest portion (typically 60-80%)
- Should be representative
- Model performance here doesn't indicate real-world performance
Validation Set
Used during development to tune hyperparameters and make design decisions.
Characteristics:
- Medium portion (typically 10-20%)
- Used to select between model variants
- Helps detect overfitting during development
Test Set
Used for final evaluation after development is complete.
Characteristics:
- Held out until final evaluation
- Never used for training or tuning
- Provides unbiased performance estimate
Why Splitting Matters
If you evaluate on training data, you measure memorization, not learning. The model might perform perfectly on data it's seen but fail on new data.
Common mistake: Using test data during development, then "evaluating" on that same test data. This gives optimistically biased estimates.
Best practice: Strict separation. Test data touches the model only once for final evaluation.
Cross-Validation
When data is limited, cross-validation provides more reliable estimates.
K-fold cross-validation:
- Split data into K portions (folds)
- Train on K-1 folds, validate on 1 fold
- Repeat K times with different validation folds
- Average performance across folds
Benefits:
- Uses all data for training and validation
- Provides variance estimate
- Reduces luck in split selection
Testing relevance: Cross-validation performance is more reliable than single-split performance for limited data.
Data Drift
Data changes over time, and these changes can degrade model performance.
Types of Drift
Concept drift: The relationship between inputs and outputs changes.
- Customer preferences evolve
- Fraud patterns change
- Medical best practices update
Data drift (covariate shift): Input distributions change while relationships stay the same.
- Demographics of users change
- Sensor calibration shifts
- Data collection methods change
Label drift: Output distributions change.
- Disease prevalence changes
- Product popularity shifts
Detecting Drift
Statistical tests: Compare distributions between training data and production data.
Performance monitoring: Track accuracy metrics over time. Declining performance suggests drift.
Prediction monitoring: Track prediction distribution changes. Unusual patterns suggest drift.
Addressing Drift
Model retraining: Update the model with recent data.
Sliding windows: Train on recent data only, discarding old data.
Ensemble methods: Combine models trained on different time periods.
Testing relevance: Test with data from different time periods. Monitor for drift in production.
Chapter 5: ML Performance Metrics
Metrics quantify model quality. Choosing appropriate metrics is crucial for meaningful evaluation.
The Confusion Matrix
For classification problems, the confusion matrix summarizes predictions:
Predicted
Positive Negative
Actual Positive TP FN
Negative FP TNTrue Positives (TP): Correctly predicted positive True Negatives (TN): Correctly predicted negative False Positives (FP): Incorrectly predicted positive (Type I error) False Negatives (FN): Incorrectly predicted negative (Type II error)
All classification metrics derive from these four values.
Reading a Confusion Matrix
Example: Cancer screening model
Predicted
Cancer No Cancer
Actual Cancer 80 20 (100 actual cancer)
No Cancer 30 870 (900 actual no cancer)
(110 predicted cancer) (890 predicted no cancer)- TP = 80: Correctly identified cancer
- TN = 870: Correctly identified no cancer
- FP = 30: Healthy people incorrectly flagged (unnecessary anxiety, testing)
- FN = 20: Cancer missed (potentially serious consequences)
Classification Metrics
Accuracy
Formula: (TP + TN) / (TP + TN + FP + FN)
Meaning: Proportion of all predictions that are correct.
Example: (80 + 870) / 1000 = 95%
When it's useful: Balanced classes where all errors are equally costly.
When it's misleading: Imbalanced classes. A model that predicts "no cancer" for everyone would achieve 90% accuracy in our example but miss all cancers.
Precision
Formula: TP / (TP + FP)
Meaning: Of all positive predictions, how many were correct?
Example: 80 / (80 + 30) = 72.7%
When it matters: When false positives are costly. For spam filtering, low precision means legitimate emails go to spam.
Recall (Sensitivity, True Positive Rate)
Formula: TP / (TP + FN)
Meaning: Of all actual positives, how many did we catch?
Example: 80 / (80 + 20) = 80%
When it matters: When false negatives are costly. For cancer screening, low recall means missing cancer cases.
F1 Score
Formula: 2 * (Precision * Recall) / (Precision + Recall)
Meaning: Harmonic mean of precision and recall. Balances both concerns.
Example: 2 * (0.727 * 0.80) / (0.727 + 0.80) = 76.2%
When it's useful: When you need a single metric that considers both false positives and false negatives.
Specificity (True Negative Rate)
Formula: TN / (TN + FP)
Meaning: Of all actual negatives, how many did we correctly identify?
Example: 870 / (870 + 30) = 96.7%
When it matters: When correctly identifying negatives is important.
Exam Tip: Practice calculating these metrics from confusion matrices. Questions often provide a matrix and ask for specific metrics.
The Precision-Recall Trade-off
Precision and recall trade off against each other. You can usually increase one by decreasing the other by adjusting the prediction threshold.
Lower threshold (more positive predictions):
- Catches more true positives (higher recall)
- Also produces more false positives (lower precision)
Higher threshold (fewer positive predictions):
- More confident predictions (higher precision)
- Misses more true positives (lower recall)
Choosing the trade-off depends on context:
- Cancer screening: Prioritize recall (don't miss cancer)
- Spam filtering: Balance depends on user tolerance
- Fraud detection: May prioritize precision (don't block legitimate transactions)
ROC Curve and AUC
The ROC (Receiver Operating Characteristic) curve plots true positive rate against false positive rate across different thresholds.
AUC (Area Under the Curve): Single number summarizing ROC curve performance.
- AUC = 1.0: Perfect classification
- AUC = 0.5: Random guessing
- AUC < 0.5: Worse than random (model is inverted)
When it's useful: Comparing models without committing to a specific threshold.
Regression Metrics
For continuous predictions, different metrics apply.
Mean Squared Error (MSE)
Formula: Average of (actual - predicted)^2
Meaning: Average squared difference between predictions and actual values.
Characteristics:
- Penalizes large errors more than small errors
- Units are squared (e.g., dollars^2)
- Lower is better
Root Mean Squared Error (RMSE)
Formula: Square root of MSE
Meaning: Same as MSE but in original units.
Characteristics:
- Easier to interpret than MSE
- Still penalizes large errors more
- Lower is better
Mean Absolute Error (MAE)
Formula: Average of |actual - predicted|
Meaning: Average absolute difference between predictions and actual values.
Characteristics:
- Linear penalty for errors
- Less sensitive to outliers than MSE/RMSE
- Lower is better
R-Squared (Coefficient of Determination)
Formula: 1 - (sum of squared residuals / total sum of squares)
Meaning: Proportion of variance explained by the model.
Characteristics:
- Ranges from 0 to 1 (can be negative for very poor models)
- 1 = perfect predictions
- 0 = model explains nothing beyond mean
- Higher is better
Benchmarking and Baselines
Model performance is meaningful only in comparison to alternatives.
Types of Baselines
Random baseline: What would random predictions achieve?
Constant baseline: What if you always predicted the most common class or the mean value?
Simple rule baseline: What would simple business rules achieve?
Previous system baseline: How did the existing solution perform?
Human baseline: How well do human experts perform?
Why Baselines Matter
A model with 85% accuracy sounds good until you learn:
- The constant baseline achieves 90% (highly imbalanced data)
- Human experts achieve 98% (model is much worse)
- Random would achieve 50% (model is significantly better)
Always contextualize metrics with relevant baselines.
Establishing Meaningful Benchmarks
Statistical benchmarks: Metrics that any reasonable model should beat.
Business benchmarks: Minimum performance for business viability.
Competitive benchmarks: Performance of alternative solutions.
Improvement benchmarks: Performance of previous model versions.
Testing should verify that models exceed relevant benchmarks.
Test Your Knowledge
Quiz on CT-AI Machine Learning Overview
Your Score: 0/10
Question: A model is trained to predict house prices based on features like square footage, location, and number of bedrooms. What type of machine learning is this?
Frequently Asked Questions
Frequently Asked Questions (FAQs) / People Also Ask (PAA)
Do I need to implement ML algorithms for the CT-AI exam?
How do I calculate precision, recall, and F1 from a confusion matrix?
What's the difference between overfitting and underfitting?
Why do we split data into training, validation, and test sets?
When should I use accuracy vs precision vs recall?
What is data drift and why does it matter for testing?
What data quality problems should testers look for?
How do baselines help evaluate ML models?