ISTQB Certifications
Generative AI (CT-GenAI)
Generative AI Fundamentals

ISTQB CT-GenAI Fundamentals: Understanding Generative AI for Software Testers

Parul Dhingra - Senior Quality Analyst
Parul Dhingra13+ Years ExperienceHire Me

Senior Quality Analyst

Updated: 1/25/2026

Before you can effectively use generative AI for testing or prepare for the CT-GenAI certification exam, you need to understand what these systems actually are and how they work. This isn't about becoming an AI engineer. It's about developing the conceptual foundation that makes everything else in the syllabus make sense.

Many testers start using ChatGPT, Claude, or GitHub Copilot without understanding why they sometimes produce brilliant results and other times generate complete nonsense. The fundamentals chapter of CT-GenAI addresses this gap, ensuring you understand the technology well enough to use it effectively and recognize its limitations.

This article covers Chapter 1 of the CT-GenAI syllabus in depth. You'll learn what generative AI actually means, how Large Language Models work conceptually, what capabilities and constraints define these systems, and why their probabilistic nature matters so much for testing applications.

What is Generative AI?

Generative AI refers to artificial intelligence systems capable of creating new content rather than simply analyzing or classifying existing content. When you ask ChatGPT to write test cases, it doesn't retrieve pre-written test cases from a database. It generates new text based on patterns learned during training.

This distinction is crucial. Generative AI systems produce novel outputs that never existed before. Each response is created on demand based on your input and the model's learned patterns. The content might resemble things in the training data, but it's not copied from anywhere.

The "Generative" in Generative AI

The term "generative" comes from the model's ability to generate new instances that share characteristics with its training data. Think of it like this: if you showed someone thousands of examples of professional test case documentation, they would learn the patterns, structures, and conventions. Later, they could create new test cases that follow those same patterns even for systems they've never seen.

Generative AI works similarly but at a much larger scale. Models are trained on billions of text examples, learning patterns across languages, formats, styles, and domains. When given a prompt, they generate responses by applying these learned patterns to produce contextually appropriate content.

Types of Generative AI

While CT-GenAI focuses primarily on text-based AI (Large Language Models), generative AI encompasses several categories:

Text generation (LLMs): Models like GPT-4, Claude, and Gemini that generate written content. These are the primary focus for testing applications.

Code generation: Specialized models or capabilities that produce programming code. GitHub Copilot and Amazon CodeWhisperer are prominent examples.

Image generation: Models like DALL-E, Midjourney, and Stable Diffusion that create images from text descriptions. Less directly relevant to testing but part of the generative AI landscape.

Multi-modal models: Systems that can process and generate multiple types of content, such as models that understand both images and text. GPT-4 with vision capabilities is an example.

For the CT-GenAI exam, focus on text and code generation capabilities since these have the most direct testing applications.

Traditional AI vs Generative AI

Understanding how generative AI differs from traditional AI helps clarify why it's such a significant shift for testing.

Traditional AI: Classification and Prediction

Traditional AI systems excel at specific, well-defined tasks:

Classification: Determining which category something belongs to. Email spam filters classify messages as spam or not-spam. Image recognition systems classify images by content.

Prediction: Forecasting outcomes based on patterns. Fraud detection systems predict whether transactions are fraudulent. Recommendation systems predict what users might want.

Optimization: Finding optimal solutions within defined constraints. Route optimization finds the best delivery paths. Resource allocation balances competing demands.

These systems are "narrow AI" designed for specific tasks. A spam filter can't write emails; it can only classify them. A recommendation system can't explain why you might enjoy a product; it can only predict your likelihood of purchasing it.

Generative AI: Creation and Conversation

Generative AI represents a different paradigm:

Open-ended interaction: You can ask generative AI almost anything, and it will attempt a response. There's no fixed set of supported queries.

Novel content creation: Outputs are generated fresh rather than retrieved from storage. The same prompt may produce different responses.

Conversational capability: Models maintain context across multiple exchanges, enabling back-and-forth dialogue.

Task flexibility: A single model can help with writing, coding, analysis, explanation, translation, and many other tasks without being explicitly programmed for each.

Why This Matters for Testing

For testers, generative AI's flexibility is both its power and its danger:

Power: You can ask AI to help with almost any testing task. Generate test cases, write automation scripts, analyze defect patterns, create documentation, explain technical concepts, or brainstorm testing approaches.

Danger: Because AI attempts to respond to anything, it will generate plausible-sounding responses even when it doesn't "know" the correct answer. Unlike traditional AI that might simply decline or return "unknown," generative AI confidently produces content that may be completely wrong.

Exam Tip: Questions about traditional vs. generative AI often test whether you understand that generative AI creates new content rather than retrieving stored information, and that this creative capability comes with risks like hallucination.

Large Language Models Explained

Large Language Models (LLMs) are the foundation of modern text-based generative AI. Understanding what makes them "large" and why they're called "language models" helps you grasp their capabilities and limitations.

What Makes a Model "Large"?

LLMs are large in two dimensions:

Parameter count: LLMs contain billions of learnable parameters, which are the numerical values the model adjusts during training to improve performance. GPT-4 is estimated to have over a trillion parameters across its model architecture. These parameters encode patterns learned from training data.

Training data size: LLMs are trained on massive text datasets containing hundreds of billions to trillions of words. This includes books, websites, academic papers, code repositories, and many other text sources. The scale of training data enables the model to learn patterns across diverse domains and contexts.

The combination of many parameters and extensive training data allows LLMs to capture subtle linguistic patterns and factual associations that smaller models miss.

Why "Language Model"?

The term "language model" refers to the core function: predicting probable text. At its heart, an LLM is trained to predict what text should come next given preceding text.

During training, the model sees billions of text sequences and learns to predict missing or subsequent words. Through this process, it develops understanding of:

  • Grammar and syntax
  • Semantic relationships between concepts
  • Common patterns in different text types
  • Factual associations (though these are pattern-based, not knowledge-based)
  • Stylistic conventions for different contexts

Architectures: Transformers

Modern LLMs use a neural network architecture called the Transformer. You don't need deep technical understanding for CT-GenAI, but knowing a few key concepts helps:

Attention mechanism: Transformers use "attention" to weigh how much different parts of the input relate to each other. When generating a response about software testing, the model attends more strongly to testing-related context than unrelated content.

Parallel processing: Transformers process input tokens in parallel rather than sequentially, enabling efficient training on large datasets and faster response generation.

Context handling: The architecture enables handling long input sequences while maintaining coherence, though context length has practical limits.

Pre-training and Fine-tuning

LLMs typically undergo two training phases:

Pre-training: The model learns general language patterns from massive text datasets. This phase gives the model its broad capabilities but doesn't optimize for any specific task.

Fine-tuning: The pre-trained model is further trained on specific datasets to improve performance for particular use cases. For chat-based assistants, this often includes instruction-following training and alignment training to make responses helpful and safe.

Models may also undergo Reinforcement Learning from Human Feedback (RLHF), where human evaluators rate responses and the model learns from these ratings.

How LLMs Generate Responses

Understanding the generation process explains many LLM behaviors that affect testing applications.

Token-by-Token Generation

LLMs generate text one "token" at a time. A token is typically a word or word fragment. For each position in the response, the model:

  1. Considers the prompt and any response generated so far
  2. Calculates probability scores for every possible next token
  3. Selects the next token based on these probabilities
  4. Adds the selected token to the response
  5. Repeats until the response is complete

This sequential generation means the model doesn't "plan" its entire response in advance. It commits to each word as it goes, which can lead to inconsistencies or logical issues in longer responses.

Probability Distribution

For each token position, the model produces a probability distribution across its entire vocabulary, which typically contains 50,000 to 100,000 tokens. The model estimates which tokens are most likely to follow given the context.

For example, if the prompt is "Write a test case for the login..." the model assigns high probabilities to tokens like "function," "feature," "page," or "form" because these commonly follow such context. Unrelated tokens like "elephant" receive near-zero probability.

Temperature and Sampling

The selection of which token to actually use from the probability distribution is influenced by a parameter called "temperature":

Low temperature (near 0): The model almost always selects the highest-probability token, producing more deterministic and focused outputs.

High temperature (near 1 or above): The model samples more randomly across probable tokens, producing more varied and creative outputs.

This explains why asking the same question multiple times can yield different answers. Even small variations in probability sampling lead to different token selections, which cascade into different complete responses.

Why This Matters for Testing

Understanding token-by-token generation helps testers recognize:

Non-determinism is fundamental: Variation in outputs isn't a bug; it's how the technology works. Expecting identical responses to identical prompts misunderstands the system.

Early errors compound: If the model makes a poor word choice early in a response, subsequent generation builds on that mistake. This is why AI sometimes confidently pursues incorrect paths.

Length affects quality: Very long responses may drift from the original intent because the model must maintain coherence across many sequential decisions.

Capabilities of Generative AI

Knowing what generative AI can do well helps you apply it effectively to testing tasks.

Strong Capabilities

Pattern-based generation: LLMs excel at producing content that follows recognizable patterns. Test cases, automation scripts, documentation, and other structured content follow patterns that AI can learn and replicate.

Language understanding: Models understand prompts written in natural language, allowing you to describe what you want without rigid syntax or commands.

Format flexibility: You can request output in various formats, including bullet points, tables, code blocks, or specific documentation templates.

Explanation and elaboration: AI can expand on concepts, provide examples, or explain technical topics at different complexity levels.

Translation and transformation: Converting content between languages, formats, or styles is a strength, whether translating test cases to different languages or converting manual test steps to automation code.

Summarization: Condensing long content into concise summaries, useful for test reports, defect analysis, or requirement reviews.

Brainstorming and ideation: Generating multiple alternatives or ideas quickly, helpful when exploring test scenarios or approaches.

Testing-Specific Applications

Based on these capabilities, generative AI supports testing activities including:

  • Generating test case ideas from requirements or user stories
  • Creating test data sets with specified characteristics
  • Writing automation script templates or complete scripts
  • Producing defect report content from brief descriptions
  • Generating test documentation and reports
  • Explaining code behavior for understanding test subjects
  • Suggesting edge cases and boundary conditions
  • Converting test cases between formats or languages

Exam Tip: Questions about AI capabilities often present scenarios where you must evaluate whether a proposed use aligns with what generative AI can actually do well. Strong pattern-based tasks are good candidates; tasks requiring real-time data or guaranteed accuracy are not.

Fundamental Constraints and Limitations

Understanding limitations is arguably more important than understanding capabilities. Many problems with AI-assisted testing stem from applying AI beyond its constraints.

No True Understanding or Reasoning

Despite impressive outputs, LLMs don't "understand" text the way humans do. They process patterns without comprehension of meaning:

No semantic understanding: The model doesn't know what a "login form" actually is or does. It knows what text patterns commonly surround mentions of login forms.

Limited reasoning: While models can appear to reason, they're applying learned patterns rather than logical deduction. Complex multi-step reasoning often fails.

No common sense: Models lack the common sense knowledge humans take for granted. They may miss obvious implications or make suggestions that anyone with real-world experience would recognize as impractical.

No Access to External Information

LLMs generate responses based solely on their training data and the current conversation context:

No internet access: Unless specifically integrated with search capabilities, models can't look things up online.

No access to your systems: The model doesn't know your codebase, requirements, test environment, or organizational context unless you provide it.

No memory across sessions: Each conversation starts fresh. The model doesn't remember previous interactions with you or learn from them.

Knowledge Cutoff

Training data has a cutoff date, meaning:

Outdated information: The model may provide incorrect information about recent developments, current library versions, or updated best practices.

Missing recent context: Events, products, or changes after the cutoff date aren't reflected in responses.

Version mismatches: Technical advice may reference old tool versions or deprecated approaches.

Inability to Execute or Verify

LLMs generate text but can't execute code, access files, or verify information:

No execution: Generated code isn't tested. It may have syntax errors, logic bugs, or security vulnerabilities.

No verification: The model can't check whether its assertions are accurate. It generates what seems probable, not what's confirmed true.

No interaction with systems: The model can't actually run tests, access databases, or interact with applications.

Non-Deterministic Outputs

The probabilistic nature of LLMs has significant implications for testing applications.

Why Outputs Vary

Even with identical prompts, outputs differ because:

Sampling randomness: Token selection involves probabilistic sampling, introducing variation even with identical probability distributions.

Temperature effects: Higher temperature settings amplify variation by increasing sampling randomness.

Model state: Subtle variations in model state or infrastructure can influence outputs.

Implications for Testing

Reproducibility challenges: You can't expect identical test cases from repeated prompts. This affects test documentation and audit trails.

Validation necessity: Each AI-generated artifact needs independent validation since you can't assume consistency with previous outputs.

Version control importance: Save AI-generated artifacts rather than assuming you can regenerate them identically.

Prompt engineering value: Well-crafted prompts reduce variation by constraining the space of probable outputs, but can't eliminate it entirely.

Working With Non-Determinism

Effective strategies include:

Treat outputs as drafts: View AI-generated content as starting points requiring human review and refinement.

Generate multiple versions: Ask for several alternatives and select the best, rather than accepting the first output.

Use lower temperatures: When consistency matters, request more deterministic outputs through tool settings if available.

Document prompts and outputs: Maintain records of what prompts produced what outputs for traceability.

Context Windows and Memory

Every LLM has limited capacity for handling input, which affects how you use these tools.

What is a Context Window?

The context window is the maximum amount of text the model can consider at once, measured in tokens. This includes both your prompt and the model's response.

Context window sizes vary by model:

  • Smaller models: 4,000-8,000 tokens
  • Medium models: 16,000-32,000 tokens
  • Large context models: 100,000+ tokens

A rough estimate: 1,000 tokens equals approximately 750 words.

Context Window Limitations

Prompt length limits: Very long prompts may be truncated or cause errors.

Response truncation: Long responses may cut off when hitting context limits.

Conversation length: In multi-turn conversations, older messages eventually drop from context.

No persistent memory: Information outside the current context window isn't accessible to the model.

Working Within Context Limits

Prioritize relevant information: Include the most important context near the beginning of prompts.

Summarize when needed: For long conversations, periodically summarize previous discussion rather than relying on full history.

Break large tasks: Divide complex tasks into smaller pieces that fit within context limits.

Provide focused context: Include relevant information but avoid padding prompts with unnecessary content.

Exam Tip: Questions about context windows often test whether you understand that models have limited memory and that information outside the context window is inaccessible regardless of how relevant it might be.

Training Data and Knowledge Cutoffs

Understanding what data models learned from helps you recognize potential issues.

Training Data Sources

LLMs are typically trained on:

  • Web pages and articles
  • Books and publications
  • Code repositories
  • Academic papers
  • Documentation and manuals
  • Social media and forums
  • News articles

The specific sources vary by model, and exact training data composition is often proprietary.

Quality and Representation Issues

Training data affects outputs in important ways:

Internet overrepresentation: Web content is heavily represented, meaning internet writing styles and common web content patterns influence outputs.

Code repository biases: Code generation reflects patterns from public repositories, which may include outdated practices, bugs, or security vulnerabilities.

Language and cultural biases: Training data overrepresents certain languages and cultural perspectives, potentially leading to biased outputs.

Temporal biases: Older content may be overrepresented, causing outdated information to appear more probable.

Knowledge Cutoff Dates

Each model has a knowledge cutoff date after which training data wasn't collected:

Practical implications:

  • Questions about recent events may receive incorrect answers
  • Technical advice may reference outdated tool versions
  • Current best practices may not be reflected

Mitigation strategies:

  • Verify technical information against current documentation
  • Specify versions and dates when relevant to your needs
  • Don't rely on AI for current information without verification

Key Terminology for the Exam

The CT-GenAI exam tests specific terminology. Master these definitions:

Generative AI: AI systems that create new content rather than analyzing existing content.

Large Language Model (LLM): A neural network with billions of parameters trained on massive text datasets to predict and generate text.

Token: A unit of text processed by the model, typically a word or word fragment.

Prompt: The input text provided to an AI system to guide its response generation.

Context window: The maximum number of tokens an LLM can process at once, including both input and output.

Temperature: A parameter controlling the randomness of token selection during generation.

Hallucination: When an AI generates content that sounds plausible but is factually incorrect.

Fine-tuning: Additional training of a pre-trained model on specific data to improve performance for particular tasks.

Knowledge cutoff: The date after which training data was not collected, limiting the model's awareness of recent information.

Non-deterministic: The characteristic of producing potentially different outputs for the same input.

Inference: The process of generating a response from a trained model, as opposed to training the model.


Test Your Knowledge

Quiz on CT-GenAI Fundamentals

Your Score: 0/10

Question: What is the primary characteristic that distinguishes generative AI from traditional AI?



Frequently Asked Questions

Frequently Asked Questions (FAQs) / People Also Ask (PAA)

Do I need to understand how neural networks work for the CT-GenAI exam?

What's the difference between a hallucination and a simple error in AI output?

Why does asking the same question twice give different answers?

What happens when my prompt exceeds the context window?

Can LLMs learn from my prompts and improve over time?

How do I know if an LLM's information is current?

Why can't LLMs access my company's codebase or documentation?

What's the significance of token limits for testing applications?