How I Build Projects

A 10-phase methodology for building applied AI systems - from research through production deployment. Distilled from shipping MaplePulse, Apprentice, and SAI, generalized for any domain.

Before writing any code, find and deeply study a reference project.

Before writing any code, find and deeply study a reference project.

  1. Identify an existing open-source project that solves a similar problem
  2. Read every file - understand the data pipeline, prompting strategy, architecture decisions
  3. Create a Learning.md - a comprehensive architectural breakdown, file by file
  4. Identify what needs to change for your domain (don't port blindly - adapt)
  5. Note attribution requirements (check LICENSE files)
Output: Learning.md in docs/ with full architectural analysis

Ground everything in real data, never random values.

Ground everything in real data, never random values.

  1. Identify authoritative data sources (government census, official statistics, industry databases)
  2. Research and structure probability weights / lookup tables from real sources
  3. Build the data layer first - this is the foundation everything else depends on
  4. Validate data integrity: check for nonsensical combinations (e.g., entry-level job with executive salary)
  5. If using time-sensitive data, add inflation/adjustment factors to bring values to current year
Output: Data module with source attribution comments

Always start in Jupyter. Notebooks let you run each step independently, inspect intermediate state, and iterate without restarting servers.

Always start in Jupyter. Notebooks let you run each step independently, inspect intermediate state, and iterate without restarting servers.

  1. Create experiments/ directory for notebooks
  2. Build the core pipeline end-to-end in a notebook first
  3. Run each node independently, inspect outputs at every stage
  4. Use notebooks to discover that prompt quality is the bottleneck (it usually is)
  5. Add observability/tracing integration from day one - not after scaling
Output: Working notebook in experiments/ that proves the pipeline works

Design prompts separately from code.

Design prompts separately from code.

  1. Use a playground environment (OpenAI Playground, etc.) for prompt iteration
  2. Test against edge cases: vague inputs, culturally sensitive topics, deliberately bad content
  3. Identify failure modes: sycophancy, hallucination, generic outputs
  4. Iterate until prompts handle edge cases
  5. Hand refined prompts to Claude Code for integration

Anti-Sycophancy Checklist

  • Use narrow scales with anchored descriptions (1-5, not 1-10)
  • Avoid aspirational labels (“perfect” → use “natural”)
  • Replace booleans with 3+ level enums (true/false → irrelevant/somewhat/directly_relevant)
  • Add explicit calibration instructions: “Score independently. Do not default to positive scores.”
  • Consider multi-model rotation to diversify biases

Prompt Hardening

Don't just tell the AI what to do - tell it what NOT to do, and what to do when there's nothing to do.

  • Entity-type boundaries: Define clear category fences so the LLM doesn't misclassify entities across schema fields (e.g., a law name landing in a location field, a category ending up as a tag).
  • Exclusion categories: Explicitly list what should NOT be extracted (e.g., boilerplate, definitions, metadata that isn't actionable data).
  • Absence handling: When data is missing from the source, the model must output NOT FOUND instead of fabricating plausible values. Without this, hallucinated dates, names, or IDs silently corrupt the database with realistic-looking but invented data.
  • Over-extraction guards: Set expected output ranges so the model knows when it's being too aggressive (e.g., “this input type typically yields 10-20 items, not 50”).

Prompt Distillation

Use a strong model to teach cheap models - no fine-tuning required.

  1. Run a strong model (e.g., Claude Sonnet) once on the hardest pipeline step
  2. Study its output - extract the structural reasoning patterns it discovered
  3. Encode those patterns as explicit guidance in cheaper models' system prompts
  4. Result: cheap model follows the same playbook at 40-120x cost reduction
  5. Instantly reversible - it's just prompt text, not model weights
Output: Refined, hardened prompts ready for code integration

Decompose monolithic AI calls into transparent, reviewable steps.

Decompose monolithic AI calls into transparent, reviewable steps.

This is the most critical architectural decision for any AI-powered application. A single AI call that goes from input to output is a black box. When the output triggers real consequences, you need to know exactly why.

5a. Pipeline Decomposition

Break every AI pipeline into discrete steps where each step produces a reviewable artifact:

PatternExample (Fair Play)Example (MaplePulse)
Extract - identify what the input containsKeyword chips mapped to schema categoriesParse audience brief → structured AudienceSpec
Reconcile - match against existing stateEntity matching against DB recordsSearch existing personas before generating new ones
Plan - propose what to do (HUMAN CHECKPOINT)Table-by-table INSERT plan with confidence scoresContext projection - which persona attributes matter
Execute - carry out the approved planGenerate SQL from the planFan-out reactions from projected personas

Why decomposition works:

  1. Each step produces a reviewable artifact - no more black boxes
  2. Each step can use a different model (expensive where quality matters, cheap where mechanical)
  3. Errors are localized - if the output is wrong, check the plan; if the plan is wrong, check the extraction
  4. Human checkpoint separates “what did the AI understand?” from “what will it do?”

5b. Entity Reconciliation

A stateless pipeline will silently corrupt shared databases. If your pipeline writes to a database, it MUST know what's already there.

  • Pre-query existing records before LLM processing
  • Pass existing records as context so the LLM can match against them (catches typos, abbreviations, name variations)
  • Run deterministic conflict detection (date overlaps, duplicate entities, overwrites)
  • Human reviews flagged conflicts before execution
  • Matched IDs flow downstream so the plan reuses existing records instead of creating duplicates

5c. Per-Step Model Selection

Make model choice a first-class concern, not an afterthought:

Step TypeModel StrategyWhy
Classification / extractionCheapest (gemini-3-flash, gpt-5-nano)Mechanical task, quality floor is high
Reconciliation / matchingMid-tierNeeds semantic understanding but not creativity
Planning / reasoningBest available (Claude Sonnet)The verification gate - quality matters most here
Code generation / executionCheapFollows the plan mechanically
Multi-perspective reactionsFan-out across multiple modelsDiversity of perspective, not depth

5d. Prompt Caching

For pipelines that run repeatedly with the same system prompts / schema context:

  • Cache static context (system prompts, schema definitions, few-shot examples) across runs
  • Only variable content (the actual input) changes per call
  • OpenRouter supports this natively; ~40% cost reduction on repeated runs
  • The more the pipeline runs, the cheaper it gets
Output: Decomposed pipeline with reviewable artifacts at each step

Once the notebook validates the pipeline, deliberately expand to a proper application.

Once the notebook validates the pipeline, deliberately expand to a proper application. This is not incremental - it's a planned architectural expansion.

6a. Architecture Decisions

Make and document these choices before writing app code:

DecisionMy Defaults
BackendFastAPI (async-native, SSE, Pydantic)
FrontendNext.js + React + TypeScript + Tailwind
StreamingServer-Sent Events (simpler than WebSockets for unidirectional)
ContainerizationDocker Compose (services mounted to D: drive)
Package manageruv (never pip)
DB (simple projects)SQLite + FTS5 (migrate later if needed)
LLM routingOpenRouter (single API, cost tracking, prompt caching)
State machineLangGraph (for multi-step AI workflows with branching)

6b. Frontend-First with Mock Data

  1. Build the complete UI with mock data before connecting to backend
  2. Include deliberately negative/edge-case mock data to test all UI states
  3. Validate the full user workflow experience end-to-end with mocks
  4. Only then connect to the real backend

6c. Human-in-the-Loop Splits

Split long-running pipelines into phases with human checkpoints:

  • User reviews and approves intermediate results before proceeding
  • User can exclude bad results before they pollute downstream steps
  • This builds trust - the system shows its work

6d. Sandbox / Production Isolation

AI-generated actions must never touch production without human approval.

  • Playground mode (SQLite): full execution, reset/seed, sample data for testing
  • Production mode (PostgreSQL): review only, execution disabled, no reset capability
  • Auto-detect from DATABASE_URL - no manual toggle needed
  • This is a core architectural decision, not a convenience feature

6e. Dual-Database Strategy (when needed)

Use two databases when analytics must survive resets:

  • Main DB (DATABASE_URL): Domain data - SQLite in dev, PostgreSQL in prod
  • Analytics DB (ANALYTICS_DATABASE_URL): Always PostgreSQL - feedback ratings, token usage, cost logs persist even when main DB is reset in playground mode
Output: Working full-stack app with Docker Compose

Trace every LLM call and surface cost visibility where the user already works.

  1. Trace every LLM call - model, tokens, cost, latency
  2. Group related calls by session/run ID
  3. Name spans for each pipeline phase
  4. Use visual trace tools to find bottlenecks (Langfuse, Arize, or in-app UI)
  5. Optimize model selection: use cheap models (gemini-3-flash, gpt-5-nano) for agent/tool-calling tasks, reserve expensive models for quality-critical steps
  6. Surface observability where the user already works - token badges, cost per step, latency inline in the UI, not buried in a separate dashboard
Output: Integrated tracing with cost visibility

Build evals before declaring done. Don't rely on vibes.

Build evals before declaring done. Don't rely on vibes.

8a. Evaluation Framework

  1. Combine automated checks (fire every run) with human-in-the-loop judgments
  2. Embed eval UI into the existing workflow - don't make testers fill out separate forms
  3. Define pass criteria up-front for each eval
  4. Store eval data linked to traces (e.g., via trace_id)
  5. Automated evals fire silently; human evals are lightweight (thumbs up/down or Good/Partial/Bad)

8b. Feedback-Driven Progressive Autonomy (Trust Lifecycle)

Human ratings aren't just for measurement - they're the mechanism for earning autonomy:

PhaseBehaviorTrigger
Hands-OnHuman reviews every output, rates every stepDefault starting state
Building TrustHigh-confidence results auto-approve; human reviews only flagged itemsModel consistently rates >90% Good on a step
AutonomousRoutine work flows through automatically; humans handle exceptions onlySustained high performance + cost viability

The feedback loop closes the circle: Human ratings → model comparison → confidence thresholds → progressive autonomy. Without the ratings infrastructure, there's no path from Phase 1 to Phase 3.

Output: Eval framework with defined pass criteria and a path to progressive autonomy

Docker Compose, Nginx, SSL, and version bumping.

  1. Docker Compose with named volumes for persistence
  2. Multi-stage Docker builds (slim images, health checks, restart policies)
  3. Nginx reverse proxy with Let's Encrypt SSL
  4. Secrets in .env on server only - never in git, never in Docker image
  5. Bump APP_VERSION before every deploy
  6. Rebuild containers after backend code changes (no asking)
  7. Lifespan hooks with daemon threads for slow initialization (DB init doesn't block health checks)
Output: Deployed application

Every project gets two scripts for video presentation - one business, one technical.

Every project gets two scripts for video presentation - one business, one technical. These are written after the app works but before recording. They live in docs/.

10a. Business Script (docs/script_business.md)

Audience: Non-technical - marketers, founders, hiring managers, potential users.
Length: 5-7 minutes.
Tone: Conversational, honest, minimal jargon. Do NOT oversell.

SectionDurationPurpose
Hook15-30sOne sentence: what does this solve? Lead with a concrete scenario, not a feature list.
The Problem45-60sThe pain point everyone recognizes. Use specific examples, not abstractions.
What It Does60-90sStep-by-step walkthrough of the core workflow. Screen recording with narration.
Live Demo2-3 minHero demo with real data. Show every screen. Let reactions/results stream - don't rush. Add 1-2 quick alternative demos (different modes/use cases).
Honest Limitations30-45sWhat this is NOT. Position as complement, not replacement. Use an analogy (“spell-checker for messaging,” “first filter, not the final answer”).
Why The Data Matters30-60sWhat grounds this beyond generic AI. Real data sources, quality controls, design decisions that prevent common AI failure modes.
Close15-30sRecap + call to action. Link in description.

Rules:

  • Do NOT oversell - be honest about what it is and isn't
  • Keep technical details minimal - save for the technical video
  • Credit reference projects / inspiration
  • Include [PLACEHOLDER] tags for screen recordings and examples to fill during recording
  • Include [SCREEN: ...] cues for what to show on screen at each moment

10b. Technical Script (docs/script_technical.md)

Audience: Hiring managers evaluating decision-making, system thinking, and AI-augmented development. Also: engineers who want to understand the architecture.
Length: 15-20 minutes.
Tone: Honest narrative. Claude Code wrote most of the code - what matters is the sequence of decisions, problems spotted, architecture shaped, and interventions that turned AI output into a working product.
Style: Storytelling through the build journey - each section is a problem → decision → result.

SectionPurpose
The SparkHow you found the reference project, what you learned, what needed to change for your domain
Data GroundingWhat data you researched, why it matters, what problems you caught (e.g., nonsensical data combinations)
The PrototypeNotebook phase - what the pipeline looked like, what it taught you
Prompt EngineeringMeta prompting workflow, playground iterations, failure modes discovered
Fan-Out to Full-StackArchitecture decisions table, frontend-first approach, why each choice
The [Domain] ProblemThe biggest quality problem you hit and how you solved it (sycophancy, hallucination, duplication, etc.)
The [Key Feature]The most architecturally interesting feature and the decisions behind it
ObservabilityWhat you instrumented, what it revealed, what you optimized
Production DeploymentDeployment decisions, human-in-the-loop workflow, infrastructure
Evaluation FrameworkHow you measure quality - automated + human evals, pass criteria
Architecture EvolutionFull picture diagram - all layers, all components
What I'd Do DifferentlyHonest retrospective - shows self-awareness
My Role vs. AI's RoleTwo-column breakdown: your decisions/interventions vs. Claude Code's implementation. End with the pattern statement.

Rules:

  • Each section: show the problem, show your decision, show the result, move on
  • Reference [DIAGRAM: ...] tags for architecture diagrams (build a separate diagrams_technical.html)
  • Include a git commit table showing the evolution commit by commit
  • Include technical stats (LOC, endpoints, components, LLM calls per run, dev time)
  • Don't apologize for using AI tools - the value is in the decisions, not the keystrokes
  • The “My Role vs. AI's Role” section is critical - it demonstrates what you actually bring to the table

10c. Technical Diagrams Page (docs/diagrams_technical.html)

A self-contained HTML page opened alongside the technical video. Each diagram maps to a script section via [DIAGRAM: ...] cues.

Structure:

  • Sticky TOC at top with color-coded links to each diagram
  • One card per diagram, each with title + script section reference
  • Cards scroll independently - click TOC to jump

Diagram types to include (adapt per project):

DiagramWhat it showsVisual pattern
Data/Enrichment PipelineHow raw data becomes usable entitiesVertical flow: source → transform → enrich → output
Pipeline FlowMulti-step AI pipeline with node namesVertical chain of pipe-nodes with arrows
Project EvolutionHow the project grew phase by phaseHorizontal phase cards with dates + parallel workstreams below
Before/After ComparisonQuality problem you fixed (e.g., sycophancy, duplication)Side-by-side red/green cards with tables
Key Feature ArchitectureThe most interesting feature (agent, reconciliation, etc.)Vertical flow with nested tool boxes
Observability StackWhat you instrumented and whySide-by-side cards (production tracing vs. dev debugging)
Human-in-the-Loop WorkflowMulti-phase workflow with checkpointsHorizontal workflow cards with endpoints, red text on human gates
Evaluation FrameworkAutomated vs. human evalsTwo columns (automated / human) + storage architecture
System ArchitectureFull stack overviewThree columns (frontend / backend / data+infra)
My Role vs. AI's RoleTwo-column decision/implementation splitBlue (your decisions) + purple (Claude Code's implementation)

Design rules:

  • Pure HTML + inline CSS - no external dependencies, works offline
  • Clean, professional look: white cards on light gray, rounded corners, subtle shadows
  • Color-coded by section (blue for data, purple for pipeline, red for problems, green for solutions, amber for features)
  • Print-friendly: @media print removes shadows, adds borders
  • Use scroll-margin-top on cards so TOC jumps land cleanly below the sticky nav
  • End with “The Pattern” callout - the same statement from the technical script

10d. Optional: Marketing Script (docs/script_marketing.md)

A read-along script for screen recording the live app. Shorter (5 min), heavily demo-driven, minimal talking. Copy-paste examples provided inline for each demo scenario. Include rapid-fire section showing range of inputs.

Output: Script files in docs/, diagrams HTML page, ready for video recording

Docker-only installs, uv over pip, testing model defaults.

  • Never pip install on host machine - C: drive space constrained. Always use Docker containers mounted to D: drive
  • Always use uv instead of pip inside containers
  • Testing models: google/gemini-3-flash and openai/gpt-5-nano (cheap, fast)
  • Always rebuild Docker containers after backend code changes without asking
  • Always bump APP_VERSION in backend/main.py before deploying via push.sh

8 hard-won lessons from shipping AI systems.

01

Trust is earned, not declared

You don't flip a switch and call the system reliable. You measure your way there - Phase 1 feedback is what makes Phase 3 possible.

02

The plan is the product

Extraction and execution are implementation details. The structured plan is what a human can actually read and approve. This is the verification gate.

03

Prompt distillation > fine-tuning

Run a strong model once, encode its reasoning into cheap models' prompts. 40-120x cost reduction, no training data, instantly reversible.

04

RAG is not always the answer

For exhaustive single-document extraction, full-text input beats retrieval. RAG adds value with many documents, not single-document pipelines. Test before assuming you need it.

05

LLMs fill blanks - they don't flag them

When source data is missing, the model fabricates plausible values instead of reporting the gap. Every extraction prompt needs explicit absence handling.

06

Stateless pipelines corrupt shared databases

If your pipeline doesn't know what's already in the DB, it will silently create duplicates. Entity reconciliation is not optional for production use.

07

Playground mode is not optional

AI-generated code/queries must never touch production without human approval. The sandbox/production split is a core architectural decision.

08

Search before generate

Always check existing data before creating new data. This keeps costs low and leverages organic growth of your data pool.

I identify problems, design solutions, choose technologies, craft prompts, and direct architecture. Claude Code turns those decisions into working code - fast and reliably. The skill is knowing what to build and when to intervene.
Decision Flow
Research reference Ground in real data Prototype in notebook
Design & harden prompts Design decomposed pipeline
Fan out to full-stack Add observability
Build evals & feedback loop Deploy

Each phase validates the previous one. Don't skip ahead.