ISTQB CT-AI: Quality Characteristics for AI-Based Systems

Q: How do I test for explainability?

Testing explainability involves multiple dimensions. First, test existence: verify explanations are provided when required and accessible to intended users. Second, test understandability by having representative users (not just developers) evaluate whether explanations make sense. Third, test accuracy by comparing explanations to known feature importance and verifying that changing highlighted factors actually changes predictions. Fourth, test consistency by confirming similar predictions receive similar explanations. Tools like LIME and SHAP can generate explanations, but testing their quality requires human evaluation of whether explanations are meaningful and helpful.

Q: What's the difference between fairness and freedom from bias?

Fairness specifically concerns equitable treatment across defined groups, typically protected classes like race, gender, and age. It asks whether the system discriminates against certain populations. Freedom from bias is broader - it concerns any systematic errors that unfairly favor certain outcomes, whether or not they involve protected groups. A system could be biased (systematically wrong in certain scenarios) without being unfair to protected groups, and vice versa. Both require testing, but fairness testing specifically compares outcomes across demographic groups while bias testing examines error patterns more broadly.

Q: Which fairness metric should I use for testing?

The right fairness metric depends on context and stakeholder priorities. Demographic parity (equal positive rates) makes sense when you want equal representation. Equalized odds (equal true/false positive rates) matters when you want equal accuracy across groups. Predictive parity (equal precision) is important when false positives are costly. The impossibility theorem means you can't satisfy all metrics simultaneously when base rates differ. Work with stakeholders to determine which metric aligns with organizational values, legal requirements, and the specific decision context. Document your choice and its rationale.

Q: How do I test for adversarial robustness?

Adversarial robustness testing involves creating inputs designed to fool the model. For images, tools like Foolbox or CleverHans can generate adversarial examples with small perturbations. For text, techniques include character substitutions, synonym replacements, or adding irrelevant text. Start with known attack patterns relevant to your model type and domain. Test sensitivity to small input changes that shouldn't affect predictions. Evaluate whether defenses like adversarial training or input preprocessing help. Document which attack types the system resists and which cause failures. Remember that adversarial robustness is an ongoing arms race.

Q: What level of autonomy is appropriate for AI systems?

Appropriate autonomy depends on several factors. Decision stakes matter: higher-stakes decisions (medical, financial, legal) warrant more human oversight. AI reliability matters: more reliable systems can operate more autonomously. Reversibility matters: irreversible decisions need more control. Regulatory requirements may mandate human involvement for certain decisions. Speed requirements matter: real-time decisions may need autonomy even with some risk. Test that autonomy boundaries are correctly implemented, that human override mechanisms work, and that escalation paths function for uncertain or high-stakes situations.

Q: How do quality characteristic trade-offs affect testing?

Trade-offs mean you can't test for maximum levels of all characteristics. Work with stakeholders to understand priorities. Then test whether the chosen balance is actually achieved - for example, if stakeholders accept some accuracy loss for explainability, verify the actual accuracy level is acceptable and that explanations are provided. Document trade-offs explicitly in test reports so stakeholders can make informed decisions. Monitor whether trade-offs remain appropriate as conditions change. Be prepared to present options: 'achieving X level of fairness reduces accuracy from Y to Z.'

Q: What types of bias should I test for?

Test for multiple bias types. Selection bias: check if training data represents the target population. Measurement bias: verify data collection and labeling consistency. Aggregation bias: test whether the model works equally well for different subpopulations. Evaluation bias: confirm test data reflects production conditions. Deployment bias: verify the system is used as intended. For each bias type, identify potential sources in your system, design tests to detect them, and establish monitoring to catch bias emerging in production. Start with the bias types most likely given your data sources and use cases.

Q: How do regulations affect testing for quality characteristics?

Regulations like the EU AI Act, GDPR, and industry-specific rules create mandatory requirements. High-risk AI systems must demonstrate fairness, transparency, and robustness. GDPR requires explanations for automated decisions affecting individuals. Industry regulations may have specific requirements: medical AI needs clinical validation, financial AI needs model risk management. Testing must verify regulatory compliance, not just business requirements. Documentation is particularly important: regulators want evidence of testing methodology, results, and ongoing monitoring. Understand which regulations apply to your system and build compliance into your test strategy.

Parul Dhingra13+ Years ExperienceHire Me

Senior Quality Analyst

Updated: 1/25/2026

AI systems require evaluation against quality characteristics that traditional software testing barely addresses. While functional correctness remains important, AI systems introduce concerns around fairness, explainability, and robustness that demand specific testing approaches.

This article provides a deep dive into quality characteristics for AI-based systems, expanding on the overview from CT-AI Chapter 2 with practical testing guidance. You'll learn what each characteristic means, why it matters, how to test for it, and the trade-offs involved.

Table Of Contents-

Why AI Needs Different Quality Characteristics

Traditional software quality focuses on whether the system does what it's supposed to do. You specify behavior, implement it, and test that implementation matches specification. The relationship between input and output is deterministic and predictable.

AI systems work differently. Their behavior emerges from training data rather than explicit programming. This creates quality concerns that traditional testing doesn't address:

Behavior isn't fully specified: You can't write requirements that capture everything an ML model might do. The training data implicitly defines behavior, including behaviors you didn't anticipate.

Decisions affect people: AI systems increasingly make decisions about people - who gets loans, jobs, medical treatments, or bail. These decisions require evaluation against ethical standards, not just functional requirements.

Users can't verify correctness: When an AI recommends a diagnosis or predicts credit risk, users can't easily verify if it's correct. They need other ways to calibrate trust.

Errors have patterns: AI systems don't fail randomly. They fail systematically based on their training data and architecture. Understanding failure patterns requires different testing approaches.

These differences make AI-specific quality characteristics essential for responsible AI deployment.

Explainability and Interpretability

Explainability addresses a fundamental question: Why did the system make this decision?

What Explainability Means

Explainability is the ability to provide human-understandable explanations for AI decisions. An explainable system can tell you why it made a specific prediction or recommendation.

Interpretability is closely related but refers to the ability to understand how a model works internally. An interpretable model's decision-making process is transparent by design.

The distinction matters:

A decision tree is interpretable because you can follow the decision path
A deep neural network may be explainable through techniques like SHAP or LIME, but it's not inherently interpretable

Levels of Explainability

Global explainability describes overall model behavior:

Which features are most important across all predictions?
What patterns does the model rely on?
How does the model generally behave?

Global explanations help developers and auditors understand the model as a whole.

Local explainability explains individual predictions:

Why was this specific loan application denied?
What factors contributed to this fraud score?
Why did the system recommend this treatment?

Local explanations help users and affected individuals understand specific decisions.

Why Explainability Matters

Regulatory compliance: GDPR provides a right to explanation for automated decisions. The EU AI Act requires transparency for high-risk systems. Many industry regulations mandate explainability.

User trust: Users appropriately calibrate trust when they understand system reasoning. Unexplained AI recommendations may be blindly accepted (over-reliance) or ignored (under-reliance).

Error detection: Explanations help identify when models rely on spurious correlations. A model might achieve high accuracy by learning shortcuts that don't generalize.

Debugging: When models fail, explanations help diagnose why. Understanding which features drove a wrong prediction guides improvement efforts.

Testing Explainability

Testing explainability involves multiple dimensions:

Existence: Are explanations provided when required?

Test that explanation features are available
Verify explanations are generated for all relevant decisions
Confirm explanations are accessible to intended users

Understandability: Can target audiences understand explanations?

Test with representative users, not just developers
Assess whether technical jargon is avoided
Verify explanations match user mental models

Accuracy: Do explanations reflect actual model reasoning?

Compare explanations to known feature importance
Test that changing highlighted factors changes predictions
Verify explanations aren't misleading

Consistency: Are similar predictions explained similarly?

Test that comparable inputs receive comparable explanations
Verify explanations are stable across model versions
Assess explanation consistency across demographic groups

Exam Tip: Remember the distinction between global and local explainability. Questions often present scenarios requiring you to identify which level is relevant.

Explainability Techniques

While you don't need to implement these techniques, understanding them helps you evaluate explainability:

LIME (Local Interpretable Model-agnostic Explanations): Creates simplified models to explain individual predictions. Tests inputs similar to the prediction point to understand local behavior.

SHAP (SHapley Additive exPlanations): Uses game theory to assign importance values to each feature for a prediction. Provides both global and local explanations.

Attention visualization: For models using attention mechanisms, visualizing attention weights shows what the model focused on.

Counterfactual explanations: Describe what would need to change for a different outcome. "The loan would have been approved if income was $5,000 higher."

Fairness in AI Systems

Fairness ensures AI systems treat different groups equitably, avoiding unjust discrimination.

Defining Fairness

Fairness seems straightforward until you try to define it precisely. Multiple definitions exist, and they often conflict:

Individual fairness: Similar individuals should receive similar treatment. If two people differ only in protected attributes (race, gender, age), they should receive the same prediction.

Group fairness (demographic parity): Different groups should receive positive outcomes at equal rates. If 20% of men receive loans, 20% of women should too.

Equalized odds: True positive rates and false positive rates should be equal across groups. The model should be equally accurate (and equally wrong) for everyone.

Predictive parity: Precision should be equal across groups. When the model predicts positive, it should be equally likely to be correct regardless of group.

The Impossibility Theorem

Here's a critical insight for CT-AI: These fairness definitions are mathematically incompatible (except in special cases). A model cannot simultaneously satisfy demographic parity, equalized odds, and predictive parity unless base rates are equal across groups.

This means fairness requires choices:

Which groups matter?
What fairness definition applies?
How do we handle conflicts between definitions?

These are business and ethical decisions, not purely technical ones. Testers need to understand which fairness criteria the system should meet, then verify it meets them.

Why Fairness Matters

Legal requirements: Anti-discrimination laws prohibit unfair treatment based on protected characteristics in many domains including employment, lending, housing, and healthcare.

Ethical obligations: Organizations have moral duties to treat people fairly, regardless of legal requirements.

Business risk: Unfair AI systems create reputational damage, legal liability, and loss of customer trust.

Social impact: AI systems increasingly affect life outcomes. Unfair systems perpetuate and amplify existing inequalities.

Testing Fairness

Define protected groups and relevant metrics:

Identify protected attributes (race, gender, age, disability, etc.)
Select appropriate fairness metrics based on context and requirements
Establish acceptable thresholds for fairness measurements

Measure performance across groups:

Calculate accuracy metrics for each protected group
Compare positive prediction rates
Assess false positive and false negative rates by group

Test for proxy discrimination:

Identify features correlated with protected attributes
Evaluate whether the model uses proxies to achieve discriminatory outcomes
Test with protected attributes removed to assess indirect effects

Evaluate training data:

Assess representation across groups in training data
Identify historical bias that might be encoded
Evaluate labeling consistency across groups

Monitor production fairness:

Track fairness metrics in production
Alert on fairness degradation
Audit sample decisions for fairness

Fairness Scenarios

Hiring AI: A system screening job applicants should evaluate candidates fairly regardless of gender, race, or age. Testing should verify equal callback rates for equally qualified candidates across groups.

Lending AI: A credit scoring system shouldn't discriminate based on protected attributes. However, it may legitimately consider factors correlated with creditworthiness that happen to differ across groups.

Medical AI: A diagnostic system should be equally accurate for all patient populations. Underrepresentation of certain groups in training data often causes worse performance for those groups.

Criminal justice AI: Risk assessment tools should achieve equal accuracy across racial groups. History shows this is particularly challenging given biased historical data.

Freedom from Bias

Bias in AI refers to systematic errors that unfairly favor certain outcomes. It's related to fairness but distinct - a system can be biased without discriminating against protected groups.

Types of Bias

Selection bias: Training data doesn't represent the population the model will serve.

Training a medical AI on data primarily from research hospitals in wealthy areas
Building a voice assistant using recordings from native English speakers only
Developing facial recognition using datasets dominated by one demographic

Measurement bias: Systematic errors in how data is collected or labeled.

Using proxy variables that don't accurately measure what you care about
Inconsistent labeling standards applied by different annotators
Sensors that work differently under different conditions

Aggregation bias: Assuming a single model fits all subpopulations when it doesn't.

Building one model for all age groups when young and old patients present differently
Creating a single recommendation algorithm for users with very different preferences
Using averaged performance metrics that hide poor performance on subgroups

Evaluation bias: Test data doesn't represent real-world usage.

Benchmark datasets that don't reflect production conditions
Test scenarios that miss important edge cases
Evaluation metrics that don't capture relevant quality dimensions

Deployment bias: The system is used differently than intended.

A support tool being used as a decision-maker
A general-purpose model applied to a specialized domain
Predictions being used without appropriate human oversight

Testing for Bias

Data assessment:

Evaluate training data representativeness
Identify underrepresented groups or scenarios
Assess label quality and consistency
Test for historical bias encoded in data

Model evaluation:

Test performance on subgroups
Compare predictions against unbiased baselines
Identify features that carry biased signals
Evaluate model behavior at distribution edges

Production monitoring:

Track performance metrics by segment
Detect bias emergence over time
Monitor for bias introduced by feedback loops
Audit sample decisions for bias patterns

Exam Tip: Distinguish between different bias types. Questions may present scenarios and ask you to identify which type of bias is present or most likely.

Transparency

Transparency makes AI system operations visible and understandable to relevant stakeholders.

Components of Transparency

Data transparency: What data was used?

Sources of training data
Data collection methods
Data preprocessing applied
Known data limitations

Model transparency: How does the model work?

Model architecture and type
Feature engineering applied
Training methodology
Hyperparameter choices

Decision transparency: How are predictions made?

Which features influenced specific decisions
Confidence levels for predictions
Limitations of predictions

Process transparency: How is the system managed?

Development and validation procedures
Testing conducted
Monitoring in place
Update and retraining processes

Outcome transparency: What results does the system produce?

Aggregate performance metrics
Known failure modes
Documented limitations
Impact assessments

Why Transparency Matters

Accountability: Transparent systems allow assignment of responsibility when things go wrong.

Trust calibration: Users appropriately trust systems when they understand their operation and limitations.

Audit capability: Regulators and auditors can evaluate systems when documentation is available.

Improvement: Development teams can improve systems when they understand current behavior.

Testing Transparency

Documentation completeness:

Is data lineage documented?
Are model choices explained?
Are limitations clearly stated?
Is testing methodology recorded?

Accessibility:

Can relevant stakeholders access transparency information?
Is technical content appropriately translated for different audiences?
Are explanations available at the right time?

Accuracy:

Does documentation match actual system behavior?
Are stated limitations accurate?
Do performance claims hold in practice?

Robustness

Robustness means the system maintains performance under varying conditions.

Robustness Dimensions

Input variation robustness: How does the system handle unusual but legitimate inputs?

Noisy data (blurry images, unclear audio)
Incomplete data (missing fields)
Edge cases within the valid input space

Distribution shift robustness: How does performance change when data differs from training?

Temporal drift (data patterns change over time)
Demographic shift (different user populations)
Environmental changes (new conditions)

Adversarial robustness: How does the system resist deliberate attacks?

Adversarial examples designed to fool the model
Data poisoning during training
Model extraction attempts

Operational robustness: How does the system handle operational challenges?

High load conditions
Infrastructure failures
Integration errors

Why Robustness Matters

Real-world deployment: Production conditions are messier than test conditions. Robust systems perform reliably in practice.

Safety: For safety-critical applications (medical, automotive), robustness failures can cause harm.

Security: Adversarial attacks can exploit non-robust systems for malicious purposes.

Maintenance: Robust systems require less frequent retraining and intervention.

Testing Robustness

Input perturbation testing:

Add noise to inputs and measure performance degradation
Test with corrupted or missing data
Evaluate graceful degradation for out-of-spec inputs

Distribution testing:

Test on data from different time periods
Evaluate on demographically diverse data
Test geographic or environmental variations

Adversarial testing:

Generate adversarial examples using standard techniques
Test sensitivity to small input changes
Evaluate resistance to known attack patterns

Stress testing:

Test under high load
Evaluate failure modes under resource constraints
Test recovery from errors

Autonomy and Human Oversight

Autonomy concerns how independently the system operates and what human controls exist.

Autonomy Spectrum

AI systems span a range from fully assisted to fully autonomous:

Human-in-the-loop: AI makes recommendations, humans make decisions. Examples include diagnostic support tools and loan application screening.

Human-on-the-loop: AI makes decisions, humans monitor and can intervene. Examples include algorithmic trading with circuit breakers and autonomous vehicles with human oversight.

Human-out-of-the-loop: AI operates independently without human involvement. Examples include spam filters and recommendation algorithms.

Why Autonomy Matters for Testing

The appropriate level of autonomy depends on:

Decision stakes (higher stakes require more oversight)
AI reliability (more reliable systems can be more autonomous)
Reversibility (irreversible decisions need more control)
Regulatory requirements (some decisions require human involvement)

Testing must verify:

Autonomy boundaries match requirements
Human override mechanisms work
Appropriate information supports human oversight
Escalation paths function correctly

Testing Autonomy and Oversight

Boundary testing:

Verify system operates within defined autonomy boundaries
Test that boundary violations trigger appropriate responses
Confirm human handoff works correctly

Override testing:

Test that humans can override system decisions
Verify overrides take effect appropriately
Confirm override actions are logged

Information sufficiency:

Test that oversight users receive necessary information
Verify confidence indicators are provided
Confirm uncertainty is communicated appropriately

Escalation testing:

Test escalation paths for uncertain or high-stakes decisions
Verify escalations reach appropriate reviewers
Confirm escalation timing meets requirements

Testing Quality Characteristics in Practice

Translating quality characteristics into practical tests requires structured approaches.

Quality Characteristic Test Planning

For each relevant quality characteristic:

Define requirements: What level of the characteristic is required? What are acceptable thresholds?
Identify metrics: How will you measure the characteristic? What calculations apply?
Design tests: What test scenarios will reveal characteristic levels? What data is needed?
Establish baselines: What performance is acceptable? What constitutes failure?
Plan monitoring: How will you track characteristics in production?

Integration with Test Process

AI quality characteristics fit into standard test activities:

Test analysis: Identify which quality characteristics matter for the system under test. Determine relevant fairness definitions, required explainability levels, and robustness needs.

Test design: Create tests that evaluate characteristics. Design adversarial tests, fairness measurements, and explainability assessments.

Test implementation: Build test data sets covering relevant groups. Implement automated measurements where possible.

Test execution: Run characteristic tests alongside functional tests. Document results systematically.

Test reporting: Report characteristic measurements clearly. Highlight trade-offs and limitations.

Challenges in Practice

Subjectivity: Some characteristics (like explainability to users) are inherently subjective. User testing helps but is expensive.

Incomplete requirements: Stakeholders may not have specified which fairness definition to use or what robustness level is needed.

Technical complexity: Techniques like adversarial testing require specialized expertise.

Trade-off navigation: When characteristics conflict, testers need stakeholder guidance on priorities.

Managing Quality Trade-offs

Quality characteristics interact in complex ways, often requiring trade-offs.

Common Trade-off Patterns

Accuracy vs Explainability

Complex models (neural networks) achieve higher accuracy
Simple models (decision trees, linear models) are more interpretable
Post-hoc explanation techniques can partially bridge the gap

Fairness vs Accuracy

Enforcing fairness constraints often reduces overall accuracy
The accuracy loss may be acceptable given fairness benefits
Different fairness definitions create different accuracy impacts

Robustness vs Performance

Robustness techniques add computational overhead
More robust models may be less specialized
Adversarial training increases training cost

Autonomy vs Safety

More autonomous systems are more efficient
Human oversight adds friction but catches errors
Appropriate balance depends on stakes and reliability

Navigating Trade-offs

Testers should:

Document trade-offs: Make explicit which characteristics are traded against each other
Quantify impacts: Measure how improving one characteristic affects others
Present options: Give stakeholders data to make informed decisions
Verify implementation: Confirm the chosen trade-off balance is actually achieved
Monitor over time: Track whether trade-offs remain appropriate as conditions change

Stakeholder Communication

Different stakeholders care about different characteristics:

Business stakeholders: Often prioritize accuracy and performance
Legal/compliance: Focus on fairness and regulatory requirements
End users: Care about explainability and appropriate trust
Security teams: Prioritize robustness against attacks

Testers bridge these perspectives by providing comprehensive quality information that supports balanced decision-making.

Test Your Knowledge

Quiz on CT-AI Quality Characteristics

Your Score: 0/10

Question: What is the key difference between explainability and interpretability in AI systems?

Explainability applies to models; interpretability applies to dataInterpretability refers to inherently understandable models; explainability involves explaining any model's decisionsThey are the same concept used interchangeablyInterpretability is for users; explainability is for developers

Frequently Asked Questions

Frequently Asked Questions (FAQs) / People Also Ask (PAA)

How do I test for explainability?

What's the difference between fairness and freedom from bias?

Which fairness metric should I use for testing?

How do I test for adversarial robustness?

What level of autonomy is appropriate for AI systems?

How do quality characteristic trade-offs affect testing?

What types of bias should I test for?

How do regulations affect testing for quality characteristics?

Introduction to AI Machine Learning Overview