I’m building a mobile game where AI generates hybrid creatures. Not just one image, but a complete content pipeline: constituent images, hybrid images, creature stats, animation prompts, and looping videos. Each stage feeds into the next. When one model’s output becomes another model’s input, things get interesting (and fragile) fast.

The Five-Stage Pipeline

Here’s what happens when I generate a single puzzle:

  1. Constituent images: Generate standalone portraits of each “thing” (e.g., octopus, helicopter)
  2. Hybrid image: Combine visual features into a single creature
  3. Creature metadata: Analyze the hybrid image to generate metadata related to it
  4. Animation prompt: Analyze the hybrid image to describe motion for this specific creature
  5. Video generation: Create a 5-second looping animation using the hybrid image and prompt

Base cost per puzzle: ~$0.70 (2 constituent images at $0.05 each, 1 hybrid image at $0.05, 1 video at $0.50, plus vision LLM calls). Base stages: 5. And that’s just the standard version.

For premium content, there’s an optional epic variant with three additional stages:

  1. Epic enhancement analysis: Analyze the base hybrid with vision AI to suggest dramatic enhancements (fire, glowing effects, particles)
  2. Epic image (image-to-image editing): Apply the suggested enhancements to create an enhanced version
  3. Epic animation: Generate a new animation using the epic image and earlier metadata, with enhanced motion

Epic variants depend on the base hybrid existing first, adding another $0.55 (1 image at $0.05, 1 video at $0.50, plus vision LLM calls) and three more potential failure points.

Why Cascading Pipelines Are Hard

Each stage depends on the previous one succeeding:

  • Hard dependencies: Can’t generate the hybrid without constituent images. Can’t animate without the hybrid. Can’t analyze what doesn’t exist.
  • Expensive retries: If stage 5 fails, you’ve already spent $0.15 on stages 1-4. Do you retry just stage 5? Start over? Accept partial results?
  • Consistency across stages: The animation prompt must reference what’s actually in the hybrid image. The generated metadata must match the environment. One misalignment breaks the whole chain.
  • Status ambiguity: Is this puzzle “draft” or “images_ready” or “half-broken-retry-later”?

The naive approach is to run all stages sequentially and hope nothing fails. That works until it doesn’t.

Prompting for Consistency

Before solving the pipeline fragility problem, there’s a bigger challenge: making thousands of AI-generated images look like they’re from the same game.

The secret isn’t better models. It’s rigid constraints.

Fixed Framing and Composition

Every hybrid image uses identical framing:

  • Aspect ratio: 9:16 portrait (never changes)
  • Composition: “Full frame composition, creature fills the image, no letterboxing, no black bars”
  • Camera angle: Direct, centered shot of the creature
  • Output format: JPEG for hybrids, PNG for constituents

This consistency makes the collection feel cohesive, even when you’re generating octopus-helicopter hybrids next to cactus-submarine hybrids.

Hardcoded Environment Settings

Instead of letting the AI choose environments, I maintain a dictionary of 12 predefined settings. Each environment (forest, ocean, desert, volcano, etc.) has an exact, unchanging description string.

When generating a puzzle, I randomly assign one environment and use that exact string character-for-character. No variation. No creativity. Same environment = same words every time. This ensures all creatures in the “forest” environment have identical lighting, composition, and background elements.

Locked Visual Style

Every prompt ends with the exact same style directive. Not similar. Not thematically related. Character-for-character identical.

This locked style string defines the rendering approach, lighting, atmosphere, and artistic reference points. It’s never randomized. Never modified. It’s the visual anchor that makes 45,000 possible creature combinations all feel like they belong in the same universe.

Constituent Image Templates

Constituent images use the most rigid prompts. Two hardcoded templates (creature vs object), each with exactly 4 variable slots: thing name + first 3 visual features.

Everything else - framing, camera angle, lighting, style, background - is identical across all 300-600 constituent images. Same words, same order, every time.

Result: A grid of constituent images looks like official game assets, not random AI experiments.

Bounded Randomization for Hybrids

Hybrid prompts need variety (otherwise every hybrid looks the same), but within strict boundaries:

  • Anatomy: Randomly pick from 5 predefined body types
  • Pose: Randomly pick from 5 predefined poses with specific action verbs
  • Feature count: Randomly select 2-3 features from each thing (not all features, creates dominance variation)
  • Fusion phrase: Randomly pick from a set of 5-7 transition phrases

This creates variety in pose and composition while maintaining consistent visual style and framing.

The key insight: Randomize the creature, not the aesthetics. The anatomy can vary. The art style cannot.

What Actually Works

Status Tracking with Rollback

Every puzzle has a generation_status field that moves through states:

draft → images_pending → images_ready → video_pending → video_ready → published

The key insight: Status changes happen at the START of generation, not the end. If the API call fails, we automatically roll back:

// Update status BEFORE calling the API
await db.collection('puzzles').doc(puzzle_id).update({
  generation_status: 'images_pending',
});

// Generate hybrid image
const result = await imageAPI.generate({...});

// On success, move forward
await db.collection('puzzles').doc(puzzle_id).update({
  generation_status: 'images_ready',
});

// On error (in catch block), roll back
if (puzzle.generation_status === 'images_pending') {
  await db.collection('puzzles').doc(puzzle_id).update({
    generation_status: 'draft',
  });
}

This prevents “stuck” puzzles that are forever marked as “pending” when the API call failed 3 days ago.

Preview Mode for Expensive Stages

Animation generation costs $0.50. What if the AI generates a creature standing perfectly still? Or spinning wildly off-screen?

The solution: Every generation endpoint supports preview_mode:

if (preview_mode) {
  // Store to /temp/ with unique ID
  const fileName = `puzzles/${puzzle_id}/temp/animation_preview_${uuid}.mp4`;
  // Don't update generation status or overwrite current version
} else {
  const fileName = `puzzles/${puzzle_id}/animation.mp4`;
  // Update status to video_ready
}

Preview generations are saved to temp storage and reviewed side-by-side with the current version. Accept or discard before committing. You pay for each attempt, but that’s cheaper than publishing bad content.

Explicit Dependency Checks

Before generating a hybrid, check that constituent images exist. Before animating, check that the hybrid exists. Before generating an epic animation, check that the epic image exists:

// In animation generation endpoint
if (!puzzle.hybrid_image_url) {
  return NextResponse.json({
    error: 'Hybrid image must be generated first'
  }, { status: 400 });
}

// In epic animation endpoint
if (!puzzle.epic_image_url) {
  return NextResponse.json({
    error: 'Epic image must be generated first'
  }, { status: 400 });
}

This seems obvious, but it prevents cascading failures. If constituents are missing, fail fast at stage 2 instead of trying to animate nothing at stage 5. If the epic image doesn’t exist, don’t waste $0.50 trying to animate it.

Auto-Generation with Override

Creature metadata auto-generates after the hybrid image:

// After successful hybrid generation
if (!puzzle.creature_name) {
  const metadata = await analyzeHybridImage(imageUrl, thingA, thingB, environment);
  await db.collection('puzzles').doc(puzzle_id).update({
    creature_name: metadata.name,
    creature_description: metadata.description,
  });
}

But: If analysis fails, it doesn’t break the pipeline. The hybrid image is already saved. Metadata can be retried or manually edited later. Epic variants work the same way.

Don’t treat all stages as equally critical.

What This Doesn’t Solve

Let’s be honest about the limitations:

  • Cross-stage consistency: The vision model might describe motion that doesn’t match the hybrid image. The video generator might ignore the prompt entirely. There’s no automatic verification that the animation actually looks like the prompt.
  • Cost inflation: Preview mode means you pay for every attempt. If it takes 3 tries to get a good $0.50 animation, that’s $1.50 instead of $0.50. You’re paying for exploration.
  • Latency: Five sequential API calls means 2-3 minutes per puzzle. You can’t parallelize dependent stages.
  • Prompt drift: The animation prompt is generated by analyzing the hybrid image, which was generated from a randomized template. By stage 5, you’re three layers removed from the original intent.
  • Silent failures: If the vision model returns generic descriptions or the video generator produces static video, the pipeline “succeeds” but the output is garbage. Status tracking doesn’t catch quality issues.

Cascading pipelines are powerful for variety and scale, but they amplify quality control problems.

Lessons Learned

  • Consistency requires rigid constraints: Lock down framing, environments, and visual style. Randomize the subject, not the aesthetics. Use exact strings, not creative variations
  • Generate cheap assets first: Images are $0.05, videos are $0.50. Generate all images first (constituents, hybrids, epics) before touching animations. This lets you iterate on the visual style cheaply. Videos can be generated in parallel later once images are finalized
  • Status tracking is non-negotiable: Without explicit states and rollback logic, you’ll have dozens of “pending” puzzles that failed silently
  • Preview mode costs money but saves reputation: Spending 2x on regenerations is worth it to avoid publishing bad content
  • Make dependencies explicit: Check prerequisites at the start of each stage. Fail fast if something is missing
  • Not all stages are equal: Core content (images, videos) must succeed. Metadata can fail gracefully and be retried later
  • Automate the happy path, manual the edge cases: Auto-generation works 80% of the time. The admin UI lets me regenerate or manually override the other 20%
  • Track costs honestly: Preview generations still increment the generation_cost field. It hurts to see $2.50+ spent on a single puzzle after multiple retries, but that’s reality

Try It Yourself

If you’re building multi-stage AI pipelines:

  1. Add status tracking to your data model (draft, pending, ready, failed)
  2. Implement rollback logic in your error handlers
  3. Build a preview/confirm workflow for expensive stages
  4. Make each stage independently retryable
  5. Track costs per-item, not just aggregate
About the code examples

The code examples in this post are from Hybriddle, an AI-generated creature collection game. The architecture is TypeScript with Firebase, but the patterns apply to any stack.

Cascading AI pipelines are inevitable as models get cheaper and more specialized. The trick is making each stage resilient enough to survive the others failing.