ISTQB Certifications
AI Testing (CT-AI)
Machine Learning Overview

ISTQB CT-AI: Machine Learning Fundamentals for Testers

Parul Dhingra - Senior Quality Analyst
Parul Dhingra13+ Years ExperienceHire Me

Senior Quality Analyst

Updated: 1/25/2026

Machine learning powers most AI systems you'll encounter as a tester. Understanding how ML works - not at implementation depth, but at a conceptual level - enables you to design effective tests, communicate with ML teams, and identify potential failure modes.

This article covers CT-AI syllabus chapters 3-5: Machine Learning Overview, ML Data, and ML Functional Performance Metrics. You'll learn the different types of machine learning, how data quality affects model quality, and how to interpret the metrics used to evaluate ML systems.

Chapter 3: Machine Learning Overview

Machine learning enables systems to learn patterns from data rather than following explicitly programmed rules. This fundamental shift changes how we think about testing: we're not just testing code, we're testing the combination of code, algorithms, and data.

What Makes ML Different

Traditional software:

  • Behavior is explicitly coded
  • Same input always produces same output
  • Bugs are in the code
  • Testing verifies code matches specification

Machine learning:

  • Behavior emerges from data
  • Outputs may vary (probabilistic)
  • Problems can be in code, data, or their interaction
  • Testing verifies the learned behavior meets requirements

This difference has profound implications for testing. You can't just verify that the code runs correctly - you need to verify that the model learned appropriate patterns.

Types of Machine Learning

ML systems are categorized by how they learn. Understanding the type helps you design appropriate tests.

Supervised Learning

In supervised learning, models learn from labeled examples. You provide inputs (features) and correct outputs (labels), and the model learns to map inputs to outputs.

Training process:

  1. Collect labeled data (input-output pairs)
  2. Model learns patterns connecting inputs to outputs
  3. Model can predict outputs for new inputs

Common applications:

Classification: Predicting categories

  • Spam detection (spam vs not spam)
  • Image recognition (cat vs dog vs bird)
  • Medical diagnosis (disease present vs absent)
  • Sentiment analysis (positive vs negative vs neutral)

Regression: Predicting continuous values

  • House price prediction
  • Demand forecasting
  • Time-to-failure prediction
  • Temperature prediction

Testing considerations for supervised learning:

  • Quality depends heavily on label accuracy
  • Model performance on training data may not reflect real-world performance
  • Edge cases may have few or no labeled examples
  • Label imbalance (many more of one class) affects learning

Exam Tip: Be able to identify whether a scenario involves classification (predicting categories) or regression (predicting numeric values). This affects which metrics apply.

Unsupervised Learning

In unsupervised learning, models find patterns in data without labeled examples. You provide inputs only, and the model discovers structure.

Common approaches:

Clustering: Grouping similar items

  • Customer segmentation
  • Document categorization
  • Anomaly detection (what doesn't fit clusters)
  • Gene expression analysis

Dimensionality reduction: Simplifying data

  • Feature compression
  • Visualization of high-dimensional data
  • Noise reduction

Association: Finding relationships

  • Market basket analysis (items bought together)
  • Pattern discovery

Testing considerations for unsupervised learning:

  • No "correct" answer makes evaluation harder
  • Cluster quality is often subjective
  • Results may be unstable (different runs produce different clusters)
  • Business interpretation is needed to validate meaningfulness

Reinforcement Learning

In reinforcement learning, agents learn through trial and error, receiving rewards or penalties for actions.

Components:

  • Agent: The learning system
  • Environment: Where the agent operates
  • Actions: What the agent can do
  • Rewards: Feedback on action quality
  • State: Current situation

Common applications:

  • Game playing (chess, Go, video games)
  • Robotics control
  • Resource management
  • Recommendation systems

Testing considerations for reinforcement learning:

  • Emergent behavior is hard to predict
  • Performance depends on reward function design
  • Exploration vs exploitation trade-offs affect behavior
  • Testing requires simulation environments
  • Safety constraints need explicit testing

Semi-Supervised Learning

Semi-supervised learning uses a small amount of labeled data with a large amount of unlabeled data. This is practical when labeling is expensive but unlabeled data is plentiful.

Testing considerations:

  • Label quality on the small labeled set is critical
  • Model may propagate errors from mislabeled examples
  • Performance on unlabeled data is hard to evaluate directly

Transfer Learning

Transfer learning uses a model trained on one task as a starting point for a different but related task. For example, a model trained on general image recognition might be fine-tuned for medical imaging.

Testing considerations:

  • The source and target domains should be appropriately related
  • Fine-tuning may not overcome fundamental mismatches
  • Biases from source domain may transfer

The ML Workflow

Understanding the ML development workflow helps you identify where testing fits and what can go wrong at each stage.

1. Problem Definition

Before any data or modeling:

  • Define the business problem
  • Determine if ML is the right approach
  • Specify success criteria
  • Identify constraints (latency, interpretability, fairness)

Testing relevance: Clear problem definition enables meaningful test criteria. Vague objectives lead to undefined pass/fail criteria.

2. Data Collection

Gathering data for training:

  • Identify data sources
  • Collect representative samples
  • Ensure sufficient volume
  • Address legal and privacy requirements

Testing relevance: Data collection problems become model problems. Test data source reliability and representativeness.

3. Data Preparation

Transforming raw data for model consumption:

  • Clean data (handle missing values, outliers, errors)
  • Transform features (normalization, encoding)
  • Engineer features (create derived variables)
  • Select features (choose relevant inputs)

Testing relevance: Preparation pipelines can introduce bugs. Test that transformations work correctly and consistently.

4. Model Training

Learning patterns from prepared data:

  • Choose algorithm(s) to try
  • Train models on training data
  • Tune hyperparameters
  • Validate on held-out data

Testing relevance: Training configuration affects results. Test that training completes successfully and reproducibly.

5. Model Evaluation

Assessing model quality:

  • Measure performance on test data
  • Evaluate against baseline
  • Check for overfitting
  • Assess fairness and other quality characteristics

Testing relevance: This is where most explicit testing occurs. Comprehensive evaluation is essential before deployment.

6. Model Deployment

Moving model to production:

  • Package model for deployment
  • Integrate with applications
  • Configure serving infrastructure
  • Set up monitoring

Testing relevance: Deployment can break things that worked in development. Test end-to-end integration and performance.

7. Model Monitoring

Ongoing observation in production:

  • Track performance metrics
  • Detect data drift
  • Monitor for bias emergence
  • Alert on degradation

Testing relevance: Production testing continues after deployment. Monitoring is testing in production.

8. Model Maintenance

Keeping models current:

  • Retrain on new data
  • Update for changed requirements
  • Address discovered issues
  • Retire obsolete models

Testing relevance: Model updates need testing like code updates. Regression testing for ML includes performance comparison.

Overfitting and Underfitting

Two fundamental problems affect ML model quality.

Overfitting

Definition: The model learns training data too well, including noise and coincidental patterns that don't generalize.

Symptoms:

  • Excellent performance on training data
  • Poor performance on new data
  • Model is overly complex
  • Predictions are overconfident

Causes:

  • Model is too complex for the data
  • Training too long
  • Insufficient training data
  • Training data not representative

Testing implications:

  • Compare performance on training vs test data
  • Large gaps indicate overfitting
  • Test on data from different times or sources
  • Evaluate with cross-validation

Underfitting

Definition: The model is too simple to capture patterns in the data.

Symptoms:

  • Poor performance on training data
  • Poor performance on new data
  • Model doesn't capture known patterns
  • Predictions are too generic

Causes:

  • Model is too simple
  • Insufficient features
  • Not enough training
  • Poor feature engineering

Testing implications:

  • Performance should exceed naive baselines
  • Model should capture known patterns
  • Feature importance should make sense

The Bias-Variance Trade-off

Overfitting and underfitting relate to a fundamental trade-off:

Bias: Error from overly simple assumptions (underfitting) Variance: Error from sensitivity to training data fluctuations (overfitting)

The goal is finding appropriate complexity that minimizes total error.

Chapter 4: ML Data

Data quality determines model quality. "Garbage in, garbage out" applies doubly to ML: bad data not only produces bad outputs but teaches bad patterns.

Data Quality Dimensions

Accuracy

Data should correctly represent reality.

Problems:

  • Measurement errors
  • Transcription mistakes
  • Outdated information
  • Misrecorded values

Testing:

  • Validate against known sources
  • Check for impossible values
  • Compare to physical constraints
  • Audit sample records

Completeness

Data should include all relevant information.

Problems:

  • Missing values
  • Incomplete records
  • Unrecorded events
  • Sampling gaps

Testing:

  • Count missing values
  • Analyze missingness patterns
  • Evaluate coverage of important segments
  • Compare to expected data volumes

Consistency

Data should be consistent within itself and with other sources.

Problems:

  • Conflicting values for same entity
  • Different formats for same type
  • Inconsistent labeling
  • Version conflicts

Testing:

  • Cross-reference related records
  • Check referential integrity
  • Verify consistency with business rules
  • Compare to authoritative sources

Timeliness

Data should reflect current reality.

Problems:

  • Stale information
  • Delayed updates
  • Historical bias
  • Concept drift

Testing:

  • Check data timestamps
  • Evaluate freshness requirements
  • Test with recent data
  • Monitor for temporal patterns

Relevance

Data should be appropriate for the problem.

Problems:

  • Irrelevant features
  • Proxy variables instead of direct measures
  • Data from wrong population
  • Features not available at prediction time

Testing:

  • Validate feature relevance with domain experts
  • Check that training features match deployment availability
  • Evaluate if data represents target population

Data Preparation

Raw data rarely works directly for ML. Preparation transforms it into usable form.

Data Cleaning

Handling missing values:

  • Remove records with missing values
  • Impute missing values (mean, median, mode)
  • Create "missing" indicator features
  • Use algorithms that handle missing data

Testing: Verify cleaning doesn't introduce bias. Test that imputation strategies are appropriate.

Handling outliers:

  • Identify outliers (statistical methods, domain knowledge)
  • Remove, cap, or transform outliers
  • Investigate outlier causes

Testing: Verify outlier handling preserves legitimate edge cases.

Handling errors:

  • Detect inconsistent or impossible values
  • Correct fixable errors
  • Remove unfixable corrupted records

Testing: Verify error detection catches known issues.

Feature Engineering

Creating useful inputs from raw data.

Transformations:

  • Normalization (scaling to standard range)
  • Encoding categorical variables
  • Creating interaction features
  • Binning continuous variables

Testing: Verify transformations are applied consistently between training and inference.

Data Labeling

Creating ground truth for supervised learning.

Challenges:

  • Labeling is expensive and time-consuming
  • Human labelers make mistakes
  • Ambiguous cases may have no correct label
  • Labeler bias affects labels

Quality assurance:

  • Multiple labelers for same data
  • Inter-annotator agreement metrics
  • Clear labeling guidelines
  • Regular quality audits

Testing: Evaluate label quality, consistency, and potential bias.

Exam Tip: Questions about data often focus on how data problems manifest as model problems. Poor data quality leads to poor model quality, regardless of algorithm sophistication.

Data Splitting Strategies

ML development uses different data sets for different purposes.

Training Set

Used to train the model. The model learns patterns from this data.

Characteristics:

  • Largest portion (typically 60-80%)
  • Should be representative
  • Model performance here doesn't indicate real-world performance

Validation Set

Used during development to tune hyperparameters and make design decisions.

Characteristics:

  • Medium portion (typically 10-20%)
  • Used to select between model variants
  • Helps detect overfitting during development

Test Set

Used for final evaluation after development is complete.

Characteristics:

  • Held out until final evaluation
  • Never used for training or tuning
  • Provides unbiased performance estimate

Why Splitting Matters

If you evaluate on training data, you measure memorization, not learning. The model might perform perfectly on data it's seen but fail on new data.

Common mistake: Using test data during development, then "evaluating" on that same test data. This gives optimistically biased estimates.

Best practice: Strict separation. Test data touches the model only once for final evaluation.

Cross-Validation

When data is limited, cross-validation provides more reliable estimates.

K-fold cross-validation:

  1. Split data into K portions (folds)
  2. Train on K-1 folds, validate on 1 fold
  3. Repeat K times with different validation folds
  4. Average performance across folds

Benefits:

  • Uses all data for training and validation
  • Provides variance estimate
  • Reduces luck in split selection

Testing relevance: Cross-validation performance is more reliable than single-split performance for limited data.

Data Drift

Data changes over time, and these changes can degrade model performance.

Types of Drift

Concept drift: The relationship between inputs and outputs changes.

  • Customer preferences evolve
  • Fraud patterns change
  • Medical best practices update

Data drift (covariate shift): Input distributions change while relationships stay the same.

  • Demographics of users change
  • Sensor calibration shifts
  • Data collection methods change

Label drift: Output distributions change.

  • Disease prevalence changes
  • Product popularity shifts

Detecting Drift

Statistical tests: Compare distributions between training data and production data.

Performance monitoring: Track accuracy metrics over time. Declining performance suggests drift.

Prediction monitoring: Track prediction distribution changes. Unusual patterns suggest drift.

Addressing Drift

Model retraining: Update the model with recent data.

Sliding windows: Train on recent data only, discarding old data.

Ensemble methods: Combine models trained on different time periods.

Testing relevance: Test with data from different time periods. Monitor for drift in production.

Chapter 5: ML Performance Metrics

Metrics quantify model quality. Choosing appropriate metrics is crucial for meaningful evaluation.

The Confusion Matrix

For classification problems, the confusion matrix summarizes predictions:

                    Predicted
                 Positive  Negative
Actual Positive    TP        FN
       Negative    FP        TN

True Positives (TP): Correctly predicted positive True Negatives (TN): Correctly predicted negative False Positives (FP): Incorrectly predicted positive (Type I error) False Negatives (FN): Incorrectly predicted negative (Type II error)

All classification metrics derive from these four values.

Reading a Confusion Matrix

Example: Cancer screening model

                    Predicted
                 Cancer  No Cancer
Actual Cancer      80        20      (100 actual cancer)
       No Cancer   30       870      (900 actual no cancer)
                  (110 predicted cancer)  (890 predicted no cancer)
  • TP = 80: Correctly identified cancer
  • TN = 870: Correctly identified no cancer
  • FP = 30: Healthy people incorrectly flagged (unnecessary anxiety, testing)
  • FN = 20: Cancer missed (potentially serious consequences)

Classification Metrics

Accuracy

Formula: (TP + TN) / (TP + TN + FP + FN)

Meaning: Proportion of all predictions that are correct.

Example: (80 + 870) / 1000 = 95%

When it's useful: Balanced classes where all errors are equally costly.

When it's misleading: Imbalanced classes. A model that predicts "no cancer" for everyone would achieve 90% accuracy in our example but miss all cancers.

Precision

Formula: TP / (TP + FP)

Meaning: Of all positive predictions, how many were correct?

Example: 80 / (80 + 30) = 72.7%

When it matters: When false positives are costly. For spam filtering, low precision means legitimate emails go to spam.

Recall (Sensitivity, True Positive Rate)

Formula: TP / (TP + FN)

Meaning: Of all actual positives, how many did we catch?

Example: 80 / (80 + 20) = 80%

When it matters: When false negatives are costly. For cancer screening, low recall means missing cancer cases.

F1 Score

Formula: 2 * (Precision * Recall) / (Precision + Recall)

Meaning: Harmonic mean of precision and recall. Balances both concerns.

Example: 2 * (0.727 * 0.80) / (0.727 + 0.80) = 76.2%

When it's useful: When you need a single metric that considers both false positives and false negatives.

Specificity (True Negative Rate)

Formula: TN / (TN + FP)

Meaning: Of all actual negatives, how many did we correctly identify?

Example: 870 / (870 + 30) = 96.7%

When it matters: When correctly identifying negatives is important.

Exam Tip: Practice calculating these metrics from confusion matrices. Questions often provide a matrix and ask for specific metrics.

The Precision-Recall Trade-off

Precision and recall trade off against each other. You can usually increase one by decreasing the other by adjusting the prediction threshold.

Lower threshold (more positive predictions):

  • Catches more true positives (higher recall)
  • Also produces more false positives (lower precision)

Higher threshold (fewer positive predictions):

  • More confident predictions (higher precision)
  • Misses more true positives (lower recall)

Choosing the trade-off depends on context:

  • Cancer screening: Prioritize recall (don't miss cancer)
  • Spam filtering: Balance depends on user tolerance
  • Fraud detection: May prioritize precision (don't block legitimate transactions)

ROC Curve and AUC

The ROC (Receiver Operating Characteristic) curve plots true positive rate against false positive rate across different thresholds.

AUC (Area Under the Curve): Single number summarizing ROC curve performance.

  • AUC = 1.0: Perfect classification
  • AUC = 0.5: Random guessing
  • AUC < 0.5: Worse than random (model is inverted)

When it's useful: Comparing models without committing to a specific threshold.

Regression Metrics

For continuous predictions, different metrics apply.

Mean Squared Error (MSE)

Formula: Average of (actual - predicted)^2

Meaning: Average squared difference between predictions and actual values.

Characteristics:

  • Penalizes large errors more than small errors
  • Units are squared (e.g., dollars^2)
  • Lower is better

Root Mean Squared Error (RMSE)

Formula: Square root of MSE

Meaning: Same as MSE but in original units.

Characteristics:

  • Easier to interpret than MSE
  • Still penalizes large errors more
  • Lower is better

Mean Absolute Error (MAE)

Formula: Average of |actual - predicted|

Meaning: Average absolute difference between predictions and actual values.

Characteristics:

  • Linear penalty for errors
  • Less sensitive to outliers than MSE/RMSE
  • Lower is better

R-Squared (Coefficient of Determination)

Formula: 1 - (sum of squared residuals / total sum of squares)

Meaning: Proportion of variance explained by the model.

Characteristics:

  • Ranges from 0 to 1 (can be negative for very poor models)
  • 1 = perfect predictions
  • 0 = model explains nothing beyond mean
  • Higher is better

Benchmarking and Baselines

Model performance is meaningful only in comparison to alternatives.

Types of Baselines

Random baseline: What would random predictions achieve?

Constant baseline: What if you always predicted the most common class or the mean value?

Simple rule baseline: What would simple business rules achieve?

Previous system baseline: How did the existing solution perform?

Human baseline: How well do human experts perform?

Why Baselines Matter

A model with 85% accuracy sounds good until you learn:

  • The constant baseline achieves 90% (highly imbalanced data)
  • Human experts achieve 98% (model is much worse)
  • Random would achieve 50% (model is significantly better)

Always contextualize metrics with relevant baselines.

Establishing Meaningful Benchmarks

Statistical benchmarks: Metrics that any reasonable model should beat.

Business benchmarks: Minimum performance for business viability.

Competitive benchmarks: Performance of alternative solutions.

Improvement benchmarks: Performance of previous model versions.

Testing should verify that models exceed relevant benchmarks.


Test Your Knowledge

Quiz on CT-AI Machine Learning Overview

Your Score: 0/10

Question: A model is trained to predict house prices based on features like square footage, location, and number of bedrooms. What type of machine learning is this?



Frequently Asked Questions

Frequently Asked Questions (FAQs) / People Also Ask (PAA)

Do I need to implement ML algorithms for the CT-AI exam?

How do I calculate precision, recall, and F1 from a confusion matrix?

What's the difference between overfitting and underfitting?

Why do we split data into training, validation, and test sets?

When should I use accuracy vs precision vs recall?

What is data drift and why does it matter for testing?

What data quality problems should testers look for?

How do baselines help evaluate ML models?