
ISTQB CT-AI: Methods and Techniques for AI Testing
Testing AI systems requires specialized techniques that address challenges traditional testing methods weren't designed for. This article covers CT-AI syllabus Chapter 9: Methods and Techniques for Testing AI-Based Systems.
You'll learn specific testing methods including adversarial testing, metamorphic testing, pairwise testing, and experience-based approaches adapted for AI. Each technique addresses particular AI testing challenges, and knowing when to apply each is essential for effective AI testing.
Table Of Contents-
Overview of AI Testing Techniques
AI testing techniques address specific challenges that traditional approaches don't handle well:
| Challenge | Traditional Approach | AI-Specific Technique |
|---|---|---|
| Unknown expected outputs | Compare to specification | Metamorphic testing |
| Model robustness | Boundary value testing | Adversarial testing |
| Configuration complexity | Manual selection | Pairwise testing |
| Model comparison | Code comparison | Back-to-back testing |
| Large input spaces | Equivalence partitioning | Intelligent sampling |
These techniques complement rather than replace traditional testing. You'll still need unit tests, integration tests, and system tests - but you'll apply these specialized techniques within that framework.
Adversarial Testing
Adversarial testing probes AI system robustness by creating inputs designed to fool the model.
What is Adversarial Testing?
Adversarial testing involves creating adversarial examples - inputs specifically crafted to cause incorrect predictions while appearing normal or only slightly modified to humans.
Classic example: An image classifier correctly identifies a panda. A tiny, imperceptible perturbation is added to the image. The modified image looks identical to humans but the classifier now confidently predicts "gibbon."
This isn't a quirk of one model - adversarial vulnerability is a fundamental property of many ML systems, especially neural networks.
Why Adversarial Testing Matters
Security implications: Attackers can exploit adversarial vulnerabilities to:
- Evade spam filters with adversarial text
- Fool facial recognition systems
- Trick autonomous vehicle sensors
- Bypass malware detection
Reliability implications: Even without malicious intent, naturally occurring inputs may accidentally trigger adversarial behavior.
Trust implications: Systems that can be easily fooled shouldn't be trusted for important decisions.
Types of Adversarial Attacks
Evasion attacks: Modify inputs at inference time to cause misclassification.
- Perturbation attacks (small changes to inputs)
- Patch attacks (adding adversarial patches to images)
- Physical attacks (modifying real-world objects)
Poisoning attacks: Inject malicious data during training to corrupt the model.
- Label flipping (changing labels in training data)
- Data injection (adding adversarial examples to training)
- Backdoor attacks (creating trigger-activated misbehavior)
Model extraction attacks: Steal model functionality through queries.
- Query-based extraction
- Side-channel extraction
Adversarial Testing Approaches
White-box testing: You have full access to model architecture and weights.
- Gradient-based attacks (FGSM, PGD)
- Can generate highly effective adversarial examples
- Represents motivated, knowledgeable attackers
Black-box testing: You can only query the model, not see inside.
- Query-based attacks
- Transfer attacks (adversarial examples from similar models)
- Represents typical attacker scenario
Practical Adversarial Testing
-
Select attack types: Choose attacks relevant to your system and threat model.
-
Generate adversarial examples: Use tools like Foolbox, CleverHans, or ART (Adversarial Robustness Toolbox).
-
Test model response: Evaluate accuracy on adversarial inputs.
-
Evaluate defenses: If defenses are implemented, test their effectiveness.
-
Document vulnerabilities: Record what attacks succeed and their characteristics.
Exam Tip: Understand the concept of adversarial testing and why it matters. You don't need to implement specific attack algorithms, but know that adversarial examples are carefully crafted inputs designed to fool models while appearing normal to humans.
Limitations of Adversarial Testing
Arms race: Defenses often create new vulnerabilities.
Not comprehensive: Passing adversarial tests doesn't prove robustness.
Attack specificity: Tests only cover attempted attack types.
Resource intensive: Generating effective adversarial examples takes computation.
Metamorphic Testing
Metamorphic testing addresses the oracle problem by testing relationships between inputs and outputs.
What is Metamorphic Testing?
Instead of checking if specific outputs are correct (which may be unknown), metamorphic testing verifies that relationships between inputs and outputs are correct.
Metamorphic relation: A relationship that should hold between test inputs and outputs.
Example - Translation system:
- Input A: "Hello, how are you?"
- Input B: "Hi, how are you?" (semantically similar)
- Metamorphic relation: Translations should be similar
- Test: If translations differ dramatically, something is wrong
We don't need to know the "correct" translation to identify that wildly different translations for similar inputs indicate a problem.
Why Metamorphic Testing Works for AI
Addresses the oracle problem: Many AI predictions lack known correct answers.
Leverages domain knowledge: Experts know relationships even when exact answers are unknown.
Scales effectively: Generate many test cases from metamorphic relations.
Finds real bugs: Violations of expected relationships often indicate problems.
Types of Metamorphic Relations
Equivalence relations: Semantically equivalent inputs should produce equivalent outputs.
- Synonym substitution in text shouldn't change sentiment
- Image rotation shouldn't change object classification
- Unit conversion shouldn't change numeric predictions
Transformation relations: Known input transformations should produce expected output transformations.
- Doubling distance should roughly double delivery time
- Increasing feature value should increase predicted price
- Negating sentiment text should reverse sentiment score
Invariance relations: Some changes shouldn't affect output.
- Background changes shouldn't affect object detection
- Irrelevant features shouldn't affect predictions
- Timestamp formatting shouldn't affect behavior
Consistency relations: Predictions should be internally consistent.
- Parts should sum to whole
- Rankings should be transitive
- Probabilities should sum to 1
Designing Metamorphic Tests
-
Identify metamorphic relations: Work with domain experts to identify relationships that should hold.
-
Generate source test cases: Create initial inputs to transform.
-
Apply transformations: Transform inputs according to metamorphic relations.
-
Execute tests: Run source and transformed inputs through the system.
-
Check relations: Verify the expected relationships hold.
Example: Testing a Sentiment Analysis System
Metamorphic relations:
- Adding positive words should not decrease positive sentiment
- Adding negative words should not increase positive sentiment
- Replacing words with synonyms should not significantly change sentiment
- Negation should reverse sentiment direction
Test generation:
- Source: "This product is good"
- Transform 1: "This product is very good" (should be same or higher)
- Transform 2: "This product is good and useful" (should be same or higher)
- Transform 3: "This product is not good" (should reverse)
Verification:
- Check that relationships hold
- Flag violations for investigation
Benefits and Limitations
Benefits:
- Works without known expected outputs
- Generates many tests from few relations
- Catches meaningful bugs
- Based on domain knowledge
Limitations:
- Requires identifying valid relations
- May not catch all bugs
- Relations may be approximate
- Violations require investigation
Pairwise Testing for AI
Pairwise testing efficiently covers parameter combinations when exhaustive testing is impractical.
What is Pairwise Testing?
Pairwise testing (also called all-pairs testing) selects test cases that cover all pairs of parameter values at least once, rather than all possible combinations.
Why it works: Research shows that most bugs are triggered by interactions between one or two parameters, not complex multi-parameter interactions. Pairwise testing efficiently covers these interactions.
Applying Pairwise to AI
AI systems have many configurable parameters:
- Model hyperparameters (learning rate, batch size, layers)
- Data preprocessing options (normalization, augmentation)
- Feature selections (which features to include)
- Inference configurations (threshold, temperature)
Testing all combinations is impractical. Pairwise testing provides efficient coverage.
Example: Configuration Testing
Parameters:
- Learning rate: {0.001, 0.01, 0.1}
- Batch size: {16, 32, 64}
- Optimizer: {SGD, Adam, RMSprop}
- Dropout: {0.0, 0.25, 0.5}
Full combinations: 3 x 3 x 3 x 3 = 81 configurations
Pairwise: Approximately 9-12 configurations cover all pairs
This reduction enables testing configurations that would otherwise be skipped.
When to Use Pairwise for AI
Configuration testing: Testing different model configurations.
Data pipeline testing: Testing combinations of preprocessing steps.
Integration testing: Testing system behavior with different component configurations.
Environment testing: Testing across platform/library combinations.
Tools for Pairwise Testing
Many tools generate pairwise test cases:
- PICT (Pairwise Independent Combinatorial Testing)
- AllPairs
- Various online generators
These tools take parameter lists and generate minimal test sets covering all pairs.
Back-to-Back Testing
Back-to-back testing compares outputs from different implementations or versions.
What is Back-to-Back Testing?
Back-to-back testing runs the same inputs through two or more implementations and compares outputs. Differences indicate potential problems in one or both implementations.
Use cases:
- Comparing model versions (new vs old)
- Comparing implementations (different frameworks)
- Comparing model and reference (model vs expert)
- Comparing production and development environments
Why It's Valuable for AI
Catches regressions: New versions shouldn't perform worse on existing cases.
Validates migration: Moving to new frameworks or infrastructure shouldn't change behavior.
Provides pseudo-oracle: When you don't know the correct answer, agreement between implementations provides confidence.
Enables A/B comparison: Comparing models helps select the best one.
Implementing Back-to-Back Testing
-
Select inputs: Choose representative inputs covering important scenarios.
-
Run through implementations: Get predictions from each version.
-
Compare outputs: Identify differences.
-
Analyze differences: Determine if differences are acceptable or problematic.
Handling Expected Differences
Not all differences are bugs. ML models may differ due to:
- Training randomness
- Floating-point variations
- Legitimate improvements
Strategies:
- Use statistical comparison (aggregate metrics) rather than exact matching
- Set tolerance thresholds for acceptable variation
- Focus on significant changes (direction, magnitude)
- Flag large or systematic differences for review
Example: Model Version Comparison
Scenario: Comparing new model v2 with production model v1
Test set: 10,000 representative examples
Comparison metrics:
- Overall accuracy: v1 = 92%, v2 = 94% (improvement)
- Per-class accuracy: Check no class degrades significantly
- Disagreement rate: 8% of predictions differ
- Confidence comparison: Similar confidence distributions
Analysis: v2 shows overall improvement. Review the 8% disagreement cases to ensure they're improvements, not regressions.
Experience-Based Techniques
Experience-based testing uses tester knowledge and intuition, adapted for AI contexts.
Error Guessing for AI
Traditional error guessing anticipates defects based on experience. For AI systems, common error sources include:
Data-related errors:
- Unusual data formats
- Missing values
- Extreme values
- Data from underrepresented groups
Model-related errors:
- Edge cases in decision boundaries
- Inputs similar to misclassified training examples
- Adversarial patterns
Integration-related errors:
- Mismatched data formats between components
- Version incompatibilities
- Environment differences
Exploratory Testing for AI
Exploratory testing combines learning, test design, and execution simultaneously.
Adapted for AI:
- Explore prediction boundaries
- Probe model understanding
- Test model reactions to unusual inputs
- Investigate surprising predictions
Session structure:
- Charter: Focus area for exploration
- Time-box: Limited duration session
- Notes: Document findings and areas needing more testing
- Debrief: Share findings and plan follow-up
Example charter: "Explore how the sentiment model handles sarcasm and irony in product reviews."
Checklist-Based Testing for AI
Checklists ensure important items aren't forgotten:
Data quality checklist:
- Training data is representative
- Labels are accurate and consistent
- Missing values are handled appropriately
- Outliers are addressed
- Protected groups are adequately represented
Model evaluation checklist:
- Performance exceeds baseline
- Performance is consistent across subgroups
- Confidence calibration is reasonable
- Explanations are provided where required
- Edge cases are tested
Deployment checklist:
- Model performs similarly in production environment
- Monitoring is in place
- Rollback procedure exists
- Documentation is complete
Testing ML Pipelines
ML systems involve complex pipelines beyond just the model.
Pipeline Components
Data pipeline:
- Data extraction
- Data validation
- Data transformation
- Feature engineering
Training pipeline:
- Model training
- Hyperparameter tuning
- Model validation
- Model registration
Inference pipeline:
- Input processing
- Model inference
- Output post-processing
- Result delivery
Testing Pipeline Components
Unit tests for pipeline steps:
- Data transformations produce correct outputs
- Feature engineering logic is correct
- Preprocessing handles edge cases
Integration tests for pipeline flow:
- Data formats are consistent between steps
- Step outputs match next step inputs
- Error handling works correctly
Data validation tests:
- Schema validation (correct types, required fields)
- Range validation (values within expected ranges)
- Statistical validation (distributions match expectations)
Pipeline-Specific Concerns
Training-serving skew: Features computed during training must be computed identically during inference.
Version compatibility: Pipeline components must be compatible with each other.
Reproducibility: Pipeline runs should be reproducible given the same inputs.
Idempotency: Running pipeline multiple times should produce consistent results.
A/B Testing for AI
A/B testing compares model performance in production with real users.
What is A/B Testing for AI?
Split traffic between two model versions and compare business metrics:
- Version A: Current production model (control)
- Version B: New model candidate (treatment)
Measure:
- Business metrics (conversion, engagement, revenue)
- ML metrics (accuracy, latency)
- User experience metrics (satisfaction, complaints)
When to Use A/B Testing
Model selection: Choose between competing models based on real-world performance.
Feature evaluation: Test if new features improve outcomes.
Threshold tuning: Find optimal decision thresholds for business goals.
Risk mitigation: Validate changes before full rollout.
A/B Testing Considerations
Sample size: Need enough traffic for statistical significance.
Duration: Run long enough to capture temporal patterns.
Segmentation: Ensure random, comparable groups.
Metrics: Define success metrics before testing.
Statistical rigor: Use appropriate statistical tests.
Exam Tip: Understand when each technique applies. Adversarial testing probes robustness, metamorphic testing addresses the oracle problem, pairwise testing covers configurations efficiently, and back-to-back testing compares implementations.
Combining Techniques
Effective AI testing combines multiple techniques:
Layered approach:
- Unit tests for pipeline components (traditional testing)
- Metamorphic tests for model behavior
- Adversarial tests for robustness
- Back-to-back tests for version comparison
- A/B tests for production validation
Risk-based selection:
- High-risk systems: More comprehensive testing
- Security-sensitive: Emphasis on adversarial testing
- Fairness-critical: Emphasis on subgroup testing
- User-facing: Emphasis on experience testing
Continuous testing:
- Automated tests in CI/CD
- Monitoring as testing in production
- Periodic comprehensive evaluations
Test Your Knowledge
Quiz on CT-AI Methods and Techniques
Your Score: 0/10
Question: What is the primary goal of adversarial testing for AI systems?
Frequently Asked Questions
Frequently Asked Questions (FAQs) / People Also Ask (PAA)
When should I use adversarial testing vs metamorphic testing?
How do I identify good metamorphic relations?
Do I need to implement adversarial attacks for the CT-AI exam?
When is pairwise testing useful for AI systems?
How do I handle differences in back-to-back testing when some variation is expected?
What should I test in ML pipelines beyond the model itself?
How do I decide which AI testing techniques to use?
What's the difference between exploratory testing for traditional software vs AI?