
ISTQB CT-AI: Quality Characteristics for AI-Based Systems
AI systems require evaluation against quality characteristics that traditional software testing barely addresses. While functional correctness remains important, AI systems introduce concerns around fairness, explainability, and robustness that demand specific testing approaches.
This article provides a deep dive into quality characteristics for AI-based systems, expanding on the overview from CT-AI Chapter 2 with practical testing guidance. You'll learn what each characteristic means, why it matters, how to test for it, and the trade-offs involved.
Table Of Contents-
Why AI Needs Different Quality Characteristics
Traditional software quality focuses on whether the system does what it's supposed to do. You specify behavior, implement it, and test that implementation matches specification. The relationship between input and output is deterministic and predictable.
AI systems work differently. Their behavior emerges from training data rather than explicit programming. This creates quality concerns that traditional testing doesn't address:
Behavior isn't fully specified: You can't write requirements that capture everything an ML model might do. The training data implicitly defines behavior, including behaviors you didn't anticipate.
Decisions affect people: AI systems increasingly make decisions about people - who gets loans, jobs, medical treatments, or bail. These decisions require evaluation against ethical standards, not just functional requirements.
Users can't verify correctness: When an AI recommends a diagnosis or predicts credit risk, users can't easily verify if it's correct. They need other ways to calibrate trust.
Errors have patterns: AI systems don't fail randomly. They fail systematically based on their training data and architecture. Understanding failure patterns requires different testing approaches.
These differences make AI-specific quality characteristics essential for responsible AI deployment.
Explainability and Interpretability
Explainability addresses a fundamental question: Why did the system make this decision?
What Explainability Means
Explainability is the ability to provide human-understandable explanations for AI decisions. An explainable system can tell you why it made a specific prediction or recommendation.
Interpretability is closely related but refers to the ability to understand how a model works internally. An interpretable model's decision-making process is transparent by design.
The distinction matters:
- A decision tree is interpretable because you can follow the decision path
- A deep neural network may be explainable through techniques like SHAP or LIME, but it's not inherently interpretable
Levels of Explainability
Global explainability describes overall model behavior:
- Which features are most important across all predictions?
- What patterns does the model rely on?
- How does the model generally behave?
Global explanations help developers and auditors understand the model as a whole.
Local explainability explains individual predictions:
- Why was this specific loan application denied?
- What factors contributed to this fraud score?
- Why did the system recommend this treatment?
Local explanations help users and affected individuals understand specific decisions.
Why Explainability Matters
Regulatory compliance: GDPR provides a right to explanation for automated decisions. The EU AI Act requires transparency for high-risk systems. Many industry regulations mandate explainability.
User trust: Users appropriately calibrate trust when they understand system reasoning. Unexplained AI recommendations may be blindly accepted (over-reliance) or ignored (under-reliance).
Error detection: Explanations help identify when models rely on spurious correlations. A model might achieve high accuracy by learning shortcuts that don't generalize.
Debugging: When models fail, explanations help diagnose why. Understanding which features drove a wrong prediction guides improvement efforts.
Testing Explainability
Testing explainability involves multiple dimensions:
Existence: Are explanations provided when required?
- Test that explanation features are available
- Verify explanations are generated for all relevant decisions
- Confirm explanations are accessible to intended users
Understandability: Can target audiences understand explanations?
- Test with representative users, not just developers
- Assess whether technical jargon is avoided
- Verify explanations match user mental models
Accuracy: Do explanations reflect actual model reasoning?
- Compare explanations to known feature importance
- Test that changing highlighted factors changes predictions
- Verify explanations aren't misleading
Consistency: Are similar predictions explained similarly?
- Test that comparable inputs receive comparable explanations
- Verify explanations are stable across model versions
- Assess explanation consistency across demographic groups
Exam Tip: Remember the distinction between global and local explainability. Questions often present scenarios requiring you to identify which level is relevant.
Explainability Techniques
While you don't need to implement these techniques, understanding them helps you evaluate explainability:
LIME (Local Interpretable Model-agnostic Explanations): Creates simplified models to explain individual predictions. Tests inputs similar to the prediction point to understand local behavior.
SHAP (SHapley Additive exPlanations): Uses game theory to assign importance values to each feature for a prediction. Provides both global and local explanations.
Attention visualization: For models using attention mechanisms, visualizing attention weights shows what the model focused on.
Counterfactual explanations: Describe what would need to change for a different outcome. "The loan would have been approved if income was $5,000 higher."
Fairness in AI Systems
Fairness ensures AI systems treat different groups equitably, avoiding unjust discrimination.
Defining Fairness
Fairness seems straightforward until you try to define it precisely. Multiple definitions exist, and they often conflict:
Individual fairness: Similar individuals should receive similar treatment. If two people differ only in protected attributes (race, gender, age), they should receive the same prediction.
Group fairness (demographic parity): Different groups should receive positive outcomes at equal rates. If 20% of men receive loans, 20% of women should too.
Equalized odds: True positive rates and false positive rates should be equal across groups. The model should be equally accurate (and equally wrong) for everyone.
Predictive parity: Precision should be equal across groups. When the model predicts positive, it should be equally likely to be correct regardless of group.
The Impossibility Theorem
Here's a critical insight for CT-AI: These fairness definitions are mathematically incompatible (except in special cases). A model cannot simultaneously satisfy demographic parity, equalized odds, and predictive parity unless base rates are equal across groups.
This means fairness requires choices:
- Which groups matter?
- What fairness definition applies?
- How do we handle conflicts between definitions?
These are business and ethical decisions, not purely technical ones. Testers need to understand which fairness criteria the system should meet, then verify it meets them.
Why Fairness Matters
Legal requirements: Anti-discrimination laws prohibit unfair treatment based on protected characteristics in many domains including employment, lending, housing, and healthcare.
Ethical obligations: Organizations have moral duties to treat people fairly, regardless of legal requirements.
Business risk: Unfair AI systems create reputational damage, legal liability, and loss of customer trust.
Social impact: AI systems increasingly affect life outcomes. Unfair systems perpetuate and amplify existing inequalities.
Testing Fairness
Define protected groups and relevant metrics:
- Identify protected attributes (race, gender, age, disability, etc.)
- Select appropriate fairness metrics based on context and requirements
- Establish acceptable thresholds for fairness measurements
Measure performance across groups:
- Calculate accuracy metrics for each protected group
- Compare positive prediction rates
- Assess false positive and false negative rates by group
Test for proxy discrimination:
- Identify features correlated with protected attributes
- Evaluate whether the model uses proxies to achieve discriminatory outcomes
- Test with protected attributes removed to assess indirect effects
Evaluate training data:
- Assess representation across groups in training data
- Identify historical bias that might be encoded
- Evaluate labeling consistency across groups
Monitor production fairness:
- Track fairness metrics in production
- Alert on fairness degradation
- Audit sample decisions for fairness
Fairness Scenarios
Hiring AI: A system screening job applicants should evaluate candidates fairly regardless of gender, race, or age. Testing should verify equal callback rates for equally qualified candidates across groups.
Lending AI: A credit scoring system shouldn't discriminate based on protected attributes. However, it may legitimately consider factors correlated with creditworthiness that happen to differ across groups.
Medical AI: A diagnostic system should be equally accurate for all patient populations. Underrepresentation of certain groups in training data often causes worse performance for those groups.
Criminal justice AI: Risk assessment tools should achieve equal accuracy across racial groups. History shows this is particularly challenging given biased historical data.
Freedom from Bias
Bias in AI refers to systematic errors that unfairly favor certain outcomes. It's related to fairness but distinct - a system can be biased without discriminating against protected groups.
Types of Bias
Selection bias: Training data doesn't represent the population the model will serve.
- Training a medical AI on data primarily from research hospitals in wealthy areas
- Building a voice assistant using recordings from native English speakers only
- Developing facial recognition using datasets dominated by one demographic
Measurement bias: Systematic errors in how data is collected or labeled.
- Using proxy variables that don't accurately measure what you care about
- Inconsistent labeling standards applied by different annotators
- Sensors that work differently under different conditions
Aggregation bias: Assuming a single model fits all subpopulations when it doesn't.
- Building one model for all age groups when young and old patients present differently
- Creating a single recommendation algorithm for users with very different preferences
- Using averaged performance metrics that hide poor performance on subgroups
Evaluation bias: Test data doesn't represent real-world usage.
- Benchmark datasets that don't reflect production conditions
- Test scenarios that miss important edge cases
- Evaluation metrics that don't capture relevant quality dimensions
Deployment bias: The system is used differently than intended.
- A support tool being used as a decision-maker
- A general-purpose model applied to a specialized domain
- Predictions being used without appropriate human oversight
Testing for Bias
Data assessment:
- Evaluate training data representativeness
- Identify underrepresented groups or scenarios
- Assess label quality and consistency
- Test for historical bias encoded in data
Model evaluation:
- Test performance on subgroups
- Compare predictions against unbiased baselines
- Identify features that carry biased signals
- Evaluate model behavior at distribution edges
Production monitoring:
- Track performance metrics by segment
- Detect bias emergence over time
- Monitor for bias introduced by feedback loops
- Audit sample decisions for bias patterns
Exam Tip: Distinguish between different bias types. Questions may present scenarios and ask you to identify which type of bias is present or most likely.
Transparency
Transparency makes AI system operations visible and understandable to relevant stakeholders.
Components of Transparency
Data transparency: What data was used?
- Sources of training data
- Data collection methods
- Data preprocessing applied
- Known data limitations
Model transparency: How does the model work?
- Model architecture and type
- Feature engineering applied
- Training methodology
- Hyperparameter choices
Decision transparency: How are predictions made?
- Which features influenced specific decisions
- Confidence levels for predictions
- Limitations of predictions
Process transparency: How is the system managed?
- Development and validation procedures
- Testing conducted
- Monitoring in place
- Update and retraining processes
Outcome transparency: What results does the system produce?
- Aggregate performance metrics
- Known failure modes
- Documented limitations
- Impact assessments
Why Transparency Matters
Accountability: Transparent systems allow assignment of responsibility when things go wrong.
Trust calibration: Users appropriately trust systems when they understand their operation and limitations.
Audit capability: Regulators and auditors can evaluate systems when documentation is available.
Improvement: Development teams can improve systems when they understand current behavior.
Testing Transparency
Documentation completeness:
- Is data lineage documented?
- Are model choices explained?
- Are limitations clearly stated?
- Is testing methodology recorded?
Accessibility:
- Can relevant stakeholders access transparency information?
- Is technical content appropriately translated for different audiences?
- Are explanations available at the right time?
Accuracy:
- Does documentation match actual system behavior?
- Are stated limitations accurate?
- Do performance claims hold in practice?
Robustness
Robustness means the system maintains performance under varying conditions.
Robustness Dimensions
Input variation robustness: How does the system handle unusual but legitimate inputs?
- Noisy data (blurry images, unclear audio)
- Incomplete data (missing fields)
- Edge cases within the valid input space
Distribution shift robustness: How does performance change when data differs from training?
- Temporal drift (data patterns change over time)
- Demographic shift (different user populations)
- Environmental changes (new conditions)
Adversarial robustness: How does the system resist deliberate attacks?
- Adversarial examples designed to fool the model
- Data poisoning during training
- Model extraction attempts
Operational robustness: How does the system handle operational challenges?
- High load conditions
- Infrastructure failures
- Integration errors
Why Robustness Matters
Real-world deployment: Production conditions are messier than test conditions. Robust systems perform reliably in practice.
Safety: For safety-critical applications (medical, automotive), robustness failures can cause harm.
Security: Adversarial attacks can exploit non-robust systems for malicious purposes.
Maintenance: Robust systems require less frequent retraining and intervention.
Testing Robustness
Input perturbation testing:
- Add noise to inputs and measure performance degradation
- Test with corrupted or missing data
- Evaluate graceful degradation for out-of-spec inputs
Distribution testing:
- Test on data from different time periods
- Evaluate on demographically diverse data
- Test geographic or environmental variations
Adversarial testing:
- Generate adversarial examples using standard techniques
- Test sensitivity to small input changes
- Evaluate resistance to known attack patterns
Stress testing:
- Test under high load
- Evaluate failure modes under resource constraints
- Test recovery from errors
Autonomy and Human Oversight
Autonomy concerns how independently the system operates and what human controls exist.
Autonomy Spectrum
AI systems span a range from fully assisted to fully autonomous:
Human-in-the-loop: AI makes recommendations, humans make decisions. Examples include diagnostic support tools and loan application screening.
Human-on-the-loop: AI makes decisions, humans monitor and can intervene. Examples include algorithmic trading with circuit breakers and autonomous vehicles with human oversight.
Human-out-of-the-loop: AI operates independently without human involvement. Examples include spam filters and recommendation algorithms.
Why Autonomy Matters for Testing
The appropriate level of autonomy depends on:
- Decision stakes (higher stakes require more oversight)
- AI reliability (more reliable systems can be more autonomous)
- Reversibility (irreversible decisions need more control)
- Regulatory requirements (some decisions require human involvement)
Testing must verify:
- Autonomy boundaries match requirements
- Human override mechanisms work
- Appropriate information supports human oversight
- Escalation paths function correctly
Testing Autonomy and Oversight
Boundary testing:
- Verify system operates within defined autonomy boundaries
- Test that boundary violations trigger appropriate responses
- Confirm human handoff works correctly
Override testing:
- Test that humans can override system decisions
- Verify overrides take effect appropriately
- Confirm override actions are logged
Information sufficiency:
- Test that oversight users receive necessary information
- Verify confidence indicators are provided
- Confirm uncertainty is communicated appropriately
Escalation testing:
- Test escalation paths for uncertain or high-stakes decisions
- Verify escalations reach appropriate reviewers
- Confirm escalation timing meets requirements
Testing Quality Characteristics in Practice
Translating quality characteristics into practical tests requires structured approaches.
Quality Characteristic Test Planning
For each relevant quality characteristic:
-
Define requirements: What level of the characteristic is required? What are acceptable thresholds?
-
Identify metrics: How will you measure the characteristic? What calculations apply?
-
Design tests: What test scenarios will reveal characteristic levels? What data is needed?
-
Establish baselines: What performance is acceptable? What constitutes failure?
-
Plan monitoring: How will you track characteristics in production?
Integration with Test Process
AI quality characteristics fit into standard test activities:
Test analysis: Identify which quality characteristics matter for the system under test. Determine relevant fairness definitions, required explainability levels, and robustness needs.
Test design: Create tests that evaluate characteristics. Design adversarial tests, fairness measurements, and explainability assessments.
Test implementation: Build test data sets covering relevant groups. Implement automated measurements where possible.
Test execution: Run characteristic tests alongside functional tests. Document results systematically.
Test reporting: Report characteristic measurements clearly. Highlight trade-offs and limitations.
Challenges in Practice
Subjectivity: Some characteristics (like explainability to users) are inherently subjective. User testing helps but is expensive.
Incomplete requirements: Stakeholders may not have specified which fairness definition to use or what robustness level is needed.
Technical complexity: Techniques like adversarial testing require specialized expertise.
Trade-off navigation: When characteristics conflict, testers need stakeholder guidance on priorities.
Managing Quality Trade-offs
Quality characteristics interact in complex ways, often requiring trade-offs.
Common Trade-off Patterns
Accuracy vs Explainability
- Complex models (neural networks) achieve higher accuracy
- Simple models (decision trees, linear models) are more interpretable
- Post-hoc explanation techniques can partially bridge the gap
Fairness vs Accuracy
- Enforcing fairness constraints often reduces overall accuracy
- The accuracy loss may be acceptable given fairness benefits
- Different fairness definitions create different accuracy impacts
Robustness vs Performance
- Robustness techniques add computational overhead
- More robust models may be less specialized
- Adversarial training increases training cost
Autonomy vs Safety
- More autonomous systems are more efficient
- Human oversight adds friction but catches errors
- Appropriate balance depends on stakes and reliability
Navigating Trade-offs
Testers should:
-
Document trade-offs: Make explicit which characteristics are traded against each other
-
Quantify impacts: Measure how improving one characteristic affects others
-
Present options: Give stakeholders data to make informed decisions
-
Verify implementation: Confirm the chosen trade-off balance is actually achieved
-
Monitor over time: Track whether trade-offs remain appropriate as conditions change
Stakeholder Communication
Different stakeholders care about different characteristics:
- Business stakeholders: Often prioritize accuracy and performance
- Legal/compliance: Focus on fairness and regulatory requirements
- End users: Care about explainability and appropriate trust
- Security teams: Prioritize robustness against attacks
Testers bridge these perspectives by providing comprehensive quality information that supports balanced decision-making.
Test Your Knowledge
Quiz on CT-AI Quality Characteristics
Your Score: 0/10
Question: What is the key difference between explainability and interpretability in AI systems?
Frequently Asked Questions
Frequently Asked Questions (FAQs) / People Also Ask (PAA)
How do I test for explainability?
What's the difference between fairness and freedom from bias?
Which fairness metric should I use for testing?
How do I test for adversarial robustness?
What level of autonomy is appropriate for AI systems?
How do quality characteristic trade-offs affect testing?
What types of bias should I test for?
How do regulations affect testing for quality characteristics?