ISTQB CT-AI: Methods and Techniques for AI Testing

Q: When should I use adversarial testing vs metamorphic testing?

Use adversarial testing when robustness and security are concerns. It probes whether carefully crafted inputs can fool the model, which matters for systems vulnerable to attack (fraud detection, authentication, autonomous vehicles). Use metamorphic testing when the oracle problem is your main challenge - when you can't define expected outputs but know relationships that should hold. For example, sentiment analysis where you don't know the 'correct' sentiment but know similar texts should have similar sentiments. Often you'll use both: metamorphic for functional correctness, adversarial for robustness.

Q: How do I identify good metamorphic relations?

Good metamorphic relations come from domain knowledge. Work with domain experts to identify relationships that should hold. Consider: equivalence relations (similar inputs should produce similar outputs), transformation relations (known input changes should cause expected output changes), invariance relations (some changes shouldn't affect output), and consistency relations (outputs should be internally consistent). Test your relations: they should be specific enough to catch real bugs but not so strict they produce false alarms. Start with obvious relations and refine based on testing experience.

Q: Do I need to implement adversarial attacks for the CT-AI exam?

No, the CT-AI exam tests conceptual understanding, not implementation skills. You should understand what adversarial testing is (creating inputs designed to fool models), why it matters (security, reliability, trust), and the types of attacks (evasion attacks at inference, poisoning attacks at training). You should know that adversarial examples often appear normal to humans but cause incorrect predictions, and that tools exist to generate them. You won't write attack code or implement defense mechanisms on the exam.

Q: When is pairwise testing useful for AI systems?

Pairwise testing is useful when you need to test many configuration combinations but can't test exhaustively. Common AI applications include: testing different model hyperparameter combinations (learning rate, batch size, architecture options), testing data preprocessing configurations (normalization methods, augmentation options), testing deployment configurations (hardware, library versions, batch sizes), and testing integration scenarios. Pairwise efficiently covers parameter interactions that cause most bugs. Use pairwise tools to generate minimal test sets covering all pairs.

Q: How do I handle differences in back-to-back testing when some variation is expected?

ML models may differ due to training randomness, floating-point variations, or legitimate improvements. Handle expected differences by: using statistical comparison (aggregate metrics) rather than exact matching, setting tolerance thresholds for acceptable variation, focusing on significant changes (direction, magnitude), flagging only large or systematic differences, and reviewing disagreement cases to determine if they're improvements or regressions. Document your tolerance criteria and rationale. The goal is catching real problems while accepting acceptable variation.

Q: What should I test in ML pipelines beyond the model itself?

ML systems are pipelines with many components beyond the model. Test data extraction and ingestion (correct data is retrieved), data validation (schemas and ranges are correct), data transformation (preprocessing produces correct outputs), feature engineering (features are computed correctly), training pipeline (training completes successfully and reproducibly), model serving (inference produces correct results in production), post-processing (outputs are formatted correctly). Also test pipeline concerns like training-serving skew, version compatibility, and error handling.

Q: How do I decide which AI testing techniques to use?

Base technique selection on system risks and characteristics. For security-sensitive systems, emphasize adversarial testing. For systems where correctness is hard to define, emphasize metamorphic testing. For complex configurations, use pairwise testing. For version updates, use back-to-back testing. For production validation, use A/B testing. Most systems need multiple techniques. Create a layered approach: unit tests for components, metamorphic tests for behavior, adversarial tests for robustness, back-to-back tests for versions, exploratory testing for unknowns, and production monitoring for ongoing validation.

Q: What's the difference between exploratory testing for traditional software vs AI?

The structure is similar: time-boxed sessions with charters, combining learning and testing. But AI exploratory testing focuses on different concerns. Explore prediction boundaries to understand decision patterns. Probe model understanding by testing conceptually challenging inputs. Test reactions to unusual but valid inputs. Investigate surprising predictions to understand model reasoning. Document findings about where the model struggles. For AI, exploratory testing is particularly valuable because emergent behavior means you can't anticipate all scenarios in advance.

Parul Dhingra13+ Years ExperienceHire Me

Senior Quality Analyst

Updated: 1/25/2026

Testing AI systems requires specialized techniques that address challenges traditional testing methods weren't designed for. This article covers CT-AI syllabus Chapter 9: Methods and Techniques for Testing AI-Based Systems.

You'll learn specific testing methods including adversarial testing, metamorphic testing, pairwise testing, and experience-based approaches adapted for AI. Each technique addresses particular AI testing challenges, and knowing when to apply each is essential for effective AI testing.

Table Of Contents-

Overview of AI Testing Techniques

AI testing techniques address specific challenges that traditional approaches don't handle well:

Challenge	Traditional Approach	AI-Specific Technique
Unknown expected outputs	Compare to specification	Metamorphic testing
Model robustness	Boundary value testing	Adversarial testing
Configuration complexity	Manual selection	Pairwise testing
Model comparison	Code comparison	Back-to-back testing
Large input spaces	Equivalence partitioning	Intelligent sampling

These techniques complement rather than replace traditional testing. You'll still need unit tests, integration tests, and system tests - but you'll apply these specialized techniques within that framework.

Adversarial Testing

Adversarial testing probes AI system robustness by creating inputs designed to fool the model.

What is Adversarial Testing?

Adversarial testing involves creating adversarial examples - inputs specifically crafted to cause incorrect predictions while appearing normal or only slightly modified to humans.

Classic example: An image classifier correctly identifies a panda. A tiny, imperceptible perturbation is added to the image. The modified image looks identical to humans but the classifier now confidently predicts "gibbon."

This isn't a quirk of one model - adversarial vulnerability is a fundamental property of many ML systems, especially neural networks.

Why Adversarial Testing Matters

Security implications: Attackers can exploit adversarial vulnerabilities to:

Evade spam filters with adversarial text
Fool facial recognition systems
Trick autonomous vehicle sensors
Bypass malware detection

Reliability implications: Even without malicious intent, naturally occurring inputs may accidentally trigger adversarial behavior.

Trust implications: Systems that can be easily fooled shouldn't be trusted for important decisions.

Types of Adversarial Attacks

Evasion attacks: Modify inputs at inference time to cause misclassification.

Perturbation attacks (small changes to inputs)
Patch attacks (adding adversarial patches to images)
Physical attacks (modifying real-world objects)

Poisoning attacks: Inject malicious data during training to corrupt the model.

Label flipping (changing labels in training data)
Data injection (adding adversarial examples to training)
Backdoor attacks (creating trigger-activated misbehavior)

Model extraction attacks: Steal model functionality through queries.

Query-based extraction
Side-channel extraction

Adversarial Testing Approaches

White-box testing: You have full access to model architecture and weights.

Gradient-based attacks (FGSM, PGD)
Can generate highly effective adversarial examples
Represents motivated, knowledgeable attackers

Black-box testing: You can only query the model, not see inside.

Query-based attacks
Transfer attacks (adversarial examples from similar models)
Represents typical attacker scenario

Practical Adversarial Testing

Select attack types: Choose attacks relevant to your system and threat model.
Generate adversarial examples: Use tools like Foolbox, CleverHans, or ART (Adversarial Robustness Toolbox).
Test model response: Evaluate accuracy on adversarial inputs.
Evaluate defenses: If defenses are implemented, test their effectiveness.
Document vulnerabilities: Record what attacks succeed and their characteristics.

Exam Tip: Understand the concept of adversarial testing and why it matters. You don't need to implement specific attack algorithms, but know that adversarial examples are carefully crafted inputs designed to fool models while appearing normal to humans.

Limitations of Adversarial Testing

Arms race: Defenses often create new vulnerabilities.

Not comprehensive: Passing adversarial tests doesn't prove robustness.

Attack specificity: Tests only cover attempted attack types.

Resource intensive: Generating effective adversarial examples takes computation.

Metamorphic Testing

Metamorphic testing addresses the oracle problem by testing relationships between inputs and outputs.

What is Metamorphic Testing?

Instead of checking if specific outputs are correct (which may be unknown), metamorphic testing verifies that relationships between inputs and outputs are correct.

Metamorphic relation: A relationship that should hold between test inputs and outputs.

Example - Translation system:

Input A: "Hello, how are you?"
Input B: "Hi, how are you?" (semantically similar)
Metamorphic relation: Translations should be similar
Test: If translations differ dramatically, something is wrong

We don't need to know the "correct" translation to identify that wildly different translations for similar inputs indicate a problem.

Why Metamorphic Testing Works for AI

Addresses the oracle problem: Many AI predictions lack known correct answers.

Leverages domain knowledge: Experts know relationships even when exact answers are unknown.

Scales effectively: Generate many test cases from metamorphic relations.

Finds real bugs: Violations of expected relationships often indicate problems.

Types of Metamorphic Relations

Equivalence relations: Semantically equivalent inputs should produce equivalent outputs.

Synonym substitution in text shouldn't change sentiment
Image rotation shouldn't change object classification
Unit conversion shouldn't change numeric predictions

Transformation relations: Known input transformations should produce expected output transformations.

Doubling distance should roughly double delivery time
Increasing feature value should increase predicted price
Negating sentiment text should reverse sentiment score

Invariance relations: Some changes shouldn't affect output.

Background changes shouldn't affect object detection
Irrelevant features shouldn't affect predictions
Timestamp formatting shouldn't affect behavior

Consistency relations: Predictions should be internally consistent.

Parts should sum to whole
Rankings should be transitive
Probabilities should sum to 1

Designing Metamorphic Tests

Identify metamorphic relations: Work with domain experts to identify relationships that should hold.
Generate source test cases: Create initial inputs to transform.
Apply transformations: Transform inputs according to metamorphic relations.
Execute tests: Run source and transformed inputs through the system.
Check relations: Verify the expected relationships hold.

Example: Testing a Sentiment Analysis System

Metamorphic relations:

Adding positive words should not decrease positive sentiment
Adding negative words should not increase positive sentiment
Replacing words with synonyms should not significantly change sentiment
Negation should reverse sentiment direction

Test generation:

Source: "This product is good"
Transform 1: "This product is very good" (should be same or higher)
Transform 2: "This product is good and useful" (should be same or higher)
Transform 3: "This product is not good" (should reverse)

Verification:

Check that relationships hold
Flag violations for investigation

Benefits and Limitations

Benefits:

Works without known expected outputs
Generates many tests from few relations
Catches meaningful bugs
Based on domain knowledge

Limitations:

Requires identifying valid relations
May not catch all bugs
Relations may be approximate
Violations require investigation

Pairwise Testing for AI

Pairwise testing efficiently covers parameter combinations when exhaustive testing is impractical.

What is Pairwise Testing?

Pairwise testing (also called all-pairs testing) selects test cases that cover all pairs of parameter values at least once, rather than all possible combinations.

Why it works: Research shows that most bugs are triggered by interactions between one or two parameters, not complex multi-parameter interactions. Pairwise testing efficiently covers these interactions.

Applying Pairwise to AI

AI systems have many configurable parameters:

Model hyperparameters (learning rate, batch size, layers)
Data preprocessing options (normalization, augmentation)
Feature selections (which features to include)
Inference configurations (threshold, temperature)

Testing all combinations is impractical. Pairwise testing provides efficient coverage.

Example: Configuration Testing

Parameters:

Learning rate: {0.001, 0.01, 0.1}
Batch size: {16, 32, 64}
Optimizer: {SGD, Adam, RMSprop}
Dropout: {0.0, 0.25, 0.5}

Full combinations: 3 x 3 x 3 x 3 = 81 configurations

Pairwise: Approximately 9-12 configurations cover all pairs

This reduction enables testing configurations that would otherwise be skipped.

When to Use Pairwise for AI

Configuration testing: Testing different model configurations.

Data pipeline testing: Testing combinations of preprocessing steps.

Integration testing: Testing system behavior with different component configurations.

Environment testing: Testing across platform/library combinations.

Tools for Pairwise Testing

Many tools generate pairwise test cases:

PICT (Pairwise Independent Combinatorial Testing)
AllPairs
Various online generators

These tools take parameter lists and generate minimal test sets covering all pairs.

Back-to-Back Testing

Back-to-back testing compares outputs from different implementations or versions.

What is Back-to-Back Testing?

Back-to-back testing runs the same inputs through two or more implementations and compares outputs. Differences indicate potential problems in one or both implementations.

Use cases:

Comparing model versions (new vs old)
Comparing implementations (different frameworks)
Comparing model and reference (model vs expert)
Comparing production and development environments

Why It's Valuable for AI

Catches regressions: New versions shouldn't perform worse on existing cases.

Validates migration: Moving to new frameworks or infrastructure shouldn't change behavior.

Provides pseudo-oracle: When you don't know the correct answer, agreement between implementations provides confidence.

Enables A/B comparison: Comparing models helps select the best one.

Implementing Back-to-Back Testing

Select inputs: Choose representative inputs covering important scenarios.
Run through implementations: Get predictions from each version.
Compare outputs: Identify differences.
Analyze differences: Determine if differences are acceptable or problematic.

Handling Expected Differences

Not all differences are bugs. ML models may differ due to:

Training randomness
Floating-point variations
Legitimate improvements

Strategies:

Use statistical comparison (aggregate metrics) rather than exact matching
Set tolerance thresholds for acceptable variation
Focus on significant changes (direction, magnitude)
Flag large or systematic differences for review

Example: Model Version Comparison

Scenario: Comparing new model v2 with production model v1

Test set: 10,000 representative examples

Comparison metrics:

Overall accuracy: v1 = 92%, v2 = 94% (improvement)
Per-class accuracy: Check no class degrades significantly
Disagreement rate: 8% of predictions differ
Confidence comparison: Similar confidence distributions

Analysis: v2 shows overall improvement. Review the 8% disagreement cases to ensure they're improvements, not regressions.

Experience-Based Techniques

Experience-based testing uses tester knowledge and intuition, adapted for AI contexts.

Error Guessing for AI

Traditional error guessing anticipates defects based on experience. For AI systems, common error sources include:

Data-related errors:

Unusual data formats
Missing values
Extreme values
Data from underrepresented groups

Model-related errors:

Edge cases in decision boundaries
Inputs similar to misclassified training examples
Adversarial patterns

Integration-related errors:

Mismatched data formats between components
Version incompatibilities
Environment differences

Exploratory Testing for AI

Exploratory testing combines learning, test design, and execution simultaneously.

Adapted for AI:

Explore prediction boundaries
Probe model understanding
Test model reactions to unusual inputs
Investigate surprising predictions

Session structure:

Charter: Focus area for exploration
Time-box: Limited duration session
Notes: Document findings and areas needing more testing
Debrief: Share findings and plan follow-up

Example charter: "Explore how the sentiment model handles sarcasm and irony in product reviews."

Checklist-Based Testing for AI

Checklists ensure important items aren't forgotten:

Data quality checklist:

Training data is representative
Labels are accurate and consistent
Missing values are handled appropriately
Outliers are addressed
Protected groups are adequately represented

Model evaluation checklist:

Performance exceeds baseline
Performance is consistent across subgroups
Confidence calibration is reasonable
Explanations are provided where required
Edge cases are tested

Deployment checklist:

Model performs similarly in production environment
Monitoring is in place
Rollback procedure exists
Documentation is complete

Testing ML Pipelines

ML systems involve complex pipelines beyond just the model.

Pipeline Components

Data pipeline:

Data extraction
Data validation
Data transformation
Feature engineering

Training pipeline:

Model training
Hyperparameter tuning
Model validation
Model registration

Inference pipeline:

Input processing
Model inference
Output post-processing
Result delivery

Testing Pipeline Components

Unit tests for pipeline steps:

Data transformations produce correct outputs
Feature engineering logic is correct
Preprocessing handles edge cases

Integration tests for pipeline flow:

Data formats are consistent between steps
Step outputs match next step inputs
Error handling works correctly

Data validation tests:

Schema validation (correct types, required fields)
Range validation (values within expected ranges)
Statistical validation (distributions match expectations)

Pipeline-Specific Concerns

Training-serving skew: Features computed during training must be computed identically during inference.

Version compatibility: Pipeline components must be compatible with each other.

Reproducibility: Pipeline runs should be reproducible given the same inputs.

Idempotency: Running pipeline multiple times should produce consistent results.

A/B Testing for AI

A/B testing compares model performance in production with real users.

What is A/B Testing for AI?

Split traffic between two model versions and compare business metrics:

Version A: Current production model (control)
Version B: New model candidate (treatment)

Measure:

Business metrics (conversion, engagement, revenue)
ML metrics (accuracy, latency)
User experience metrics (satisfaction, complaints)

When to Use A/B Testing

Model selection: Choose between competing models based on real-world performance.

Feature evaluation: Test if new features improve outcomes.

Threshold tuning: Find optimal decision thresholds for business goals.

Risk mitigation: Validate changes before full rollout.

A/B Testing Considerations

Sample size: Need enough traffic for statistical significance.

Duration: Run long enough to capture temporal patterns.

Segmentation: Ensure random, comparable groups.

Metrics: Define success metrics before testing.

Statistical rigor: Use appropriate statistical tests.

Exam Tip: Understand when each technique applies. Adversarial testing probes robustness, metamorphic testing addresses the oracle problem, pairwise testing covers configurations efficiently, and back-to-back testing compares implementations.

Combining Techniques

Effective AI testing combines multiple techniques:

Layered approach:

Unit tests for pipeline components (traditional testing)
Metamorphic tests for model behavior
Adversarial tests for robustness
Back-to-back tests for version comparison
A/B tests for production validation

Risk-based selection:

High-risk systems: More comprehensive testing
Security-sensitive: Emphasis on adversarial testing
Fairness-critical: Emphasis on subgroup testing
User-facing: Emphasis on experience testing

Continuous testing:

Automated tests in CI/CD
Monitoring as testing in production
Periodic comprehensive evaluations

Test Your Knowledge

Quiz on CT-AI Methods and Techniques

Your Score: 0/10

Question: What is the primary goal of adversarial testing for AI systems?

To test system performance under high loadTo create inputs designed to cause incorrect predictions while appearing normalTo compare two versions of a modelTo generate test cases from metamorphic relations

Frequently Asked Questions

Frequently Asked Questions (FAQs) / People Also Ask (PAA)

When should I use adversarial testing vs metamorphic testing?

How do I identify good metamorphic relations?

Do I need to implement adversarial attacks for the CT-AI exam?

When is pairwise testing useful for AI systems?

How do I handle differences in back-to-back testing when some variation is expected?

What should I test in ML pipelines beyond the model itself?

How do I decide which AI testing techniques to use?

What's the difference between exploratory testing for traditional software vs AI?

Testing AI-Based Systems Using AI for Testing