gan-style-harness
Multi-agent GAN training system, separating generation and evaluation processes to achieve adversarial iterative optimization, suitable for image generation and style transfer.
npx skills add affaan-m/everything-claude-code --skill gan-style-harnessBefore / After Comparison
1 组Manually configure generator and discriminator networks, adjust hyperparameters, monitor the training process, and manually adjust loss weights. One model requires 1 week for tuning.
Automated multi-agent adversarial training system. Generator and evaluator are independently optimized, and loss weights are automatically balanced. Achieves convergence and generates high-quality images in 2 days.
gan-style-harness
GAN-Style Harness Skill
Inspired by Anthropic's Harness Design for Long-Running Application Development (March 24, 2026)
A multi-agent harness that separates generation from evaluation, creating an adversarial feedback loop that drives quality far beyond what a single agent can achieve.
Core Insight
When asked to evaluate their own work, agents are pathological optimists — they praise mediocre output and talk themselves out of legitimate issues. But engineering a separate evaluator to be ruthlessly strict is far more tractable than teaching a generator to self-critique.
This is the same dynamic as GANs (Generative Adversarial Networks): the Generator produces, the Evaluator critiques, and that feedback drives the next iteration.
When to Use
-
Building complete applications from a one-line prompt
-
Frontend design tasks requiring high visual quality
-
Full-stack projects that need working features, not just code
-
Any task where "AI slop" aesthetics are unacceptable
-
Projects where you want to invest $50-200 for production-quality output
When NOT to Use
-
Quick single-file fixes (use standard
claude -p) -
Tasks with tight budget constraints (<$10)
-
Simple refactoring (use de-sloppify pattern instead)
-
Tasks that are already well-specified with tests (use TDD workflow)
Architecture
┌─────────────┐
│ PLANNER │
│ (Opus 4.6) │
└──────┬──────┘
│ Product Spec
│ (features, sprints, design direction)
▼
┌────────────────────────┐
│ │
│ GENERATOR-EVALUATOR │
│ FEEDBACK LOOP │
│ │
│ ┌──────────┐ │
│ │GENERATOR │--build-->│──┐
│ │(Opus 4.6)│ │ │
│ └────▲─────┘ │ │
│ │ │ │ live app
│ feedback │ │
│ │ │ │
│ ┌────┴─────┐ │ │
│ │EVALUATOR │<-test----│──┘
│ │(Opus 4.6)│ │
│ │+Playwright│ │
│ └──────────┘ │
│ │
│ 5-15 iterations │
└────────────────────────┘
The Three Agents
1. Planner Agent
Role: Product manager — expands a brief prompt into a full product specification.
Key behaviors:
-
Takes a one-line prompt and produces a 16-feature, multi-sprint specification
-
Defines user stories, technical requirements, and visual design direction
-
Is deliberately ambitious — conservative planning leads to underwhelming results
-
Produces evaluation criteria that the Evaluator will use later
Model: Opus 4.6 (needs deep reasoning for spec expansion)
2. Generator Agent
Role: Developer — implements features according to the spec.
Key behaviors:
-
Works in structured sprints (or continuous mode with newer models)
-
Negotiates a "sprint contract" with the Evaluator before writing code
-
Uses full-stack tooling: React, FastAPI/Express, databases, CSS
-
Manages git for version control between iterations
-
Reads Evaluator feedback and incorporates it in next iteration
Model: Opus 4.6 (needs strong coding capability)
3. Evaluator Agent
Role: QA engineer — tests the live running application, not just code.
Key behaviors:
-
Uses Playwright MCP to interact with the live application
-
Clicks through features, fills forms, tests API endpoints
-
Scores against four criteria (configurable):
Design Quality — Does it feel like a coherent whole?
-
Originality — Custom decisions vs. template/AI patterns?
-
Craft — Typography, spacing, animations, micro-interactions?
-
Functionality — Do all features actually work?
-
Returns structured feedback with scores and specific issues
-
Is engineered to be ruthlessly strict — never praises mediocre work
Model: Opus 4.6 (needs strong judgment + tool use)
Evaluation Criteria
The default four criteria, each scored 1-10:
## Evaluation Rubric
### Design Quality (weight: 0.3)
- 1-3: Generic, template-like, "AI slop" aesthetics
- 4-6: Competent but unremarkable, follows conventions
- 7-8: Distinctive, cohesive visual identity
- 9-10: Could pass for a professional designer's work
### Originality (weight: 0.2)
- 1-3: Default colors, stock layouts, no personality
- 4-6: Some custom choices, mostly standard patterns
- 7-8: Clear creative vision, unique approach
- 9-10: Surprising, delightful, genuinely novel
### Craft (weight: 0.3)
- 1-3: Broken layouts, missing states, no animations
- 4-6: Works but feels rough, inconsistent spacing
- 7-8: Polished, smooth transitions, responsive
- 9-10: Pixel-perfect, delightful micro-interactions
### Functionality (weight: 0.2)
- 1-3: Core features broken or missing
- 4-6: Happy path works, edge cases fail
- 7-8: All features work, good error handling
- 9-10: Bulletproof, handles every edge case
Scoring
-
Weighted score = sum of (criterion_score * weight)
-
Pass threshold = 7.0 (configurable)
-
Max iterations = 15 (configurable, typically 5-15 sufficient)
Usage
Via Command
# Full three-agent harness
/project:gan-build "Build a project management app with Kanban boards, team collaboration, and dark mode"
# With custom config
/project:gan-build "Build a recipe sharing platform" --max-iterations 10 --pass-threshold 7.5
# Frontend design mode (generator + evaluator only, no planner)
/project:gan-design "Create a landing page for a crypto portfolio tracker"
Via Shell Script
# Basic usage
./scripts/gan-harness.sh "Build a music streaming dashboard"
# With options
GAN_MAX_ITERATIONS=10 \
GAN_PASS_THRESHOLD=7.5 \
GAN_EVAL_CRITERIA="functionality,performance,security" \
./scripts/gan-harness.sh "Build a REST API for task management"
Via Claude Code (Manual)
# Step 1: Plan
claude -p --model opus "You are a Product Planner. Read PLANNER_PROMPT.md. Expand this brief into a full product spec: 'Build a Kanban board app'. Write spec to spec.md"
# Step 2: Generate (iteration 1)
claude -p --model opus "You are a Generator. Read spec.md. Implement Sprint 1. Start the dev server on port 3000."
# Step 3: Evaluate (iteration 1)
claude -p --model opus --allowedTools "Read,Bash,mcp__playwright__*" "You are an Evaluator. Read EVALUATOR_PROMPT.md. Test the live app at http://localhost:3000. Score against the rubric. Write feedback to feedback-001.md"
# Step 4: Generate (iteration 2 — reads feedback)
claude -p --model opus "You are a Generator. Read spec.md and feedback-001.md. Address all issues. Improve the scores."
# Repeat steps 3-4 until pass threshold met
Evolution Across Model Capabilities
The harness should simplify as models improve. Following Anthropic's evolution:
Stage 1 — Weaker Models (Sonnet-class)
-
Full sprint decomposition required
-
Context resets between sprints (avoid context anxiety)
-
2-agent minimum: Initializer + Coding Agent
-
Heavy scaffolding compensates for model limitations
Stage 2 — Capable Models (Opus 4.5-class)
-
Full 3-agent harness: Planner + Generator + Evaluator
-
Sprint contracts before each implementation phase
-
10-sprint decomposition for complex apps
-
Context resets still useful but less critical
Stage 3 — Frontier Models (Opus 4.6-class)
-
Simplified harness: single planning pass, continuous generation
-
Evaluation reduced to single end-pass (model is smarter)
-
No sprint structure needed
-
Automatic compaction handles context growth
Key principle: Every harness component encodes an assumption about what the model can't do alone. When models improve, re-test those assumptions. Strip away what's no longer needed.
Configuration
Environment Variables
Variable Default Description
GAN_MAX_ITERATIONS
15
Maximum generator-evaluator cycles
GAN_PASS_THRESHOLD
7.0
Weighted score to pass (1-10)
GAN_PLANNER_MODEL
opus
Model for planning agent
GAN_GENERATOR_MODEL
opus
Model for generator agent
GAN_EVALUATOR_MODEL
opus
Model for evaluator agent
GAN_EVAL_CRITERIA
design,originality,craft,functionality
Comma-separated criteria
GAN_DEV_SERVER_PORT
3000
Port for the live app
GAN_DEV_SERVER_CMD
npm run dev
Command to start dev server
GAN_PROJECT_DIR
.
Project working directory
GAN_SKIP_PLANNER
false
Skip planner, use spec directly
GAN_EVAL_MODE
playwright
playwright, screenshot, or code-only
Evaluation Modes
Mode Tools Best For
playwright
Browser MCP + live interaction
Full-stack apps with UI
screenshot
Screenshot + visual analysis
Static sites, design-only
code-only
Tests + linting + build
APIs, libraries, CLI tools
Anti-Patterns
Evaluator too lenient — If the evaluator passes everything on iteration 1, your rubric is too generous. Tighten scoring criteria and add explicit penalties for common AI patterns.
Generator ignoring feedback — Ensure feedback is passed as a file, not inline. The generator should read feedback-NNN.md at the start of each iteration.
Infinite loops — Always set GAN_MAX_ITERATIONS. If the generator can't improve past a score plateau after 3 iterations, stop and flag for human review.
Evaluator testing superficially — The evaluator must use Playwright to interact with the live app, not just screenshot it. Click buttons, fill forms, test error states.
Evaluator praising its own fixes — Never let the evaluator suggest fixes and then evaluate those fixes. The evaluator only critiques; the generator fixes.
Context exhaustion — For long sessions, use Claude Agent SDK's automatic compaction or reset context between major phases.
Results: What to Expect
Based on Anthropic's published results:
Metric Solo Agent GAN Harness Improvement
Time 20 min 4-6 hours 12-18x longer
Cost $9 $125-200 14-22x more
Quality Barely functional Production-ready Phase change
Core features Broken All working N/A
Design Generic AI slop Distinctive, polished N/A
The tradeoff is clear: ~20x more time and cost for a qualitative leap in output quality. This is for projects where quality matters.
References
-
Anthropic: Harness Design for Long-Running Apps — Original paper by Prithvi Rajasekaran
-
Epsilla: The GAN-Style Agent Loop — Architecture deconstruction
-
Martin Fowler: Harness Engineering — Broader industry context
-
OpenAI: Harness Engineering — OpenAI's parallel work
Weekly Installs568Repositoryaffaan-m/everyt…ude-codeGitHub Stars152.8KFirst Seen13 days agoSecurity AuditsGen Agent Trust HubPassSocketPassSnykPassInstalled oncodex530opencode506gemini-cli501antigravity501cursor501cline500
User Reviews (0)
Write a Review
No reviews yet
Statistics
User Rating
Rate this Skill