提升深度学习实验复现与审计效率

Before: 在没有标准化工具辅助下，AI 智能体或研究人员在复现和审计深度学习实验时，常因环境差异、代码变动或数据管理不善，耗费大量时间排查问题，导致研究进展缓慢且结果不可靠。 After: RigorPilot 确保实验的科学严谨性、可复现性和可审计性。AI 智能体能快速识别并验证实验设置与结果，大幅减少排查时间，加速研究迭代，提升结果可信度。

minimal-run-and-audit · deep learning

RigorPilot Skills

Research-first Agent Skills for Deep Learning Experiments.

RigorPilot helps AI agents reproduce, improve, and explore deep learning research projects with scientific rigor: meaningful changes, fair comparison, reproducible evidence, and auditable modifications.

Not just higher scores. Meaningful deep learning research progress.

Brand note: the project brand is RigorPilot Skills; the recommended GitHub repository slug is rigorpilot-skills. Legacy install paths remain documented only as compatibility fallbacks while older clients and bookmarks migrate.

Migration note:

Project brand: ai-research-workflow-skills -> RigorPilot Skills
Existing compatible skill slugs remain available.
Preferred install source: lllllllama/rigorpilot-skills
Legacy fallback source: lllllllama/ai-paper-reproduction-skills
ai-paper-reproduction -> ai-research-reproduction
research-explore -> ai-research-explore

What RigorPilot Is

Research-first Agent Skills for deep learning experiments.
It helps AI agents reproduce, improve, explore, and audit deep learning research work.
It is designed for personal research use first.
It values scientific meaning, fair comparison, reproducibility, explainability, and collaborator control.
It encourages meaningful novelty during exploration, but does not overclaim novelty.

What RigorPilot Is Not

Not a generic coding agent.
Not a score-chasing automation framework.
Not a guarantee of novel discoveries.
Not a replacement for researcher judgment.
Not a rigid workflow that should weaken strong models.

Core Principles

Do not chase scores blindly.
Do not claim novelty lightly.
Do not break comparability silently.
Do not disguise engineering fixes as research contributions.
Do not leave collaborators out of control.

See references/research-rigor-principles.md.

Rigor and Novelty

Rigor is the baseline. Novel is the aspiration.

Novelty and significance remain hypotheses until supported by literature contrast, ablation evidence, and fair comparison.

RigorPilot should add research judgment and audit awareness without making strong models slower, more mechanical, or less capable.

Deep Learning Focus

RigorPilot is built for deep learning research repositories where README commands, environment setup, data, weights, checkpoints, training, evaluation, metrics, logs, baselines, SOTA tables, and ablations all matter.

This repository is still built around one compatibility rule: trusted by default.

Ambiguous requests route to the trusted lane.
Exploration requires explicit authorization.
Trusted outputs are auditable and durable.
Explore outputs are candidate-only and disposable.

Shared operating principles live in references/agent-operating-principles.md. They keep the skills focused on high-level guidance: think before acting, keep the solution small, change only what is necessary, and work toward verifiable goals. They are guardrails, not a detailed script for every implementation choice.

🧭 Current Repo Snapshot

This repository currently ships:

11 skills total: 9 public skills and 2 helper skills.
6 trusted-lane public skills and 3 explore-lane public skills.
4 project-scoped Claude Code command wrappers under .claude/commands/.
45 Python scripts, including 43 test scripts with focused research-explore regressions and document-structure checks.
A RigorPilot Explore chain that now includes bounded idea-seed generation, explicit idea score breakdowns, atomic idea decomposition, and implementation-fidelity evidence split into planned, heuristic, and observed layers.
A documented and tested workflow intended to be usable from both Windows PowerShell and Linux shells.

The skills use the open SKILL.md layout, so the same repository can be installed into neutral Agent Skills directories as well as Codex and Claude Code. For shared local installs, prefer ~/.agents/skills/ or ./.agents/skills/. Client-specific installs under ~/.codex/skills/ and ~/.claude/skills/ remain supported.

💻 Windows and Linux Notes

This repository is intended to be usable on both Windows and Linux.

The command examples below are written in a shell-neutral style around python ..., npx ..., and relative paths.
For user-scoped install targets, prefer $HOME/.agents/skills, $HOME/.codex/skills, and $HOME/.claude/skills. These work well in Linux shells and in PowerShell, and Python accepts forward slashes on Windows paths.
Project-scoped paths such as ./.agents/skills and ./tmp/codex-skills are also valid on both platforms.
The repository validation and routing checks are already exercised on Windows and Linux-oriented environments through local tests and CI.

🛠️ Install

For most users, start with npx. It is the shortest path and should be enough for normal use.

Recommended: `npx`

Install the full repository skill set:

npx skills add lllllllama/rigorpilot-skills --all

Install only the trusted main entrypoint:

npx skills add lllllllama/rigorpilot-skills --skill ai-research-reproduction

Install only the exploratory main entrypoint:

npx skills add lllllllama/rigorpilot-skills --skill ai-research-explore

If you only want to get started quickly, stop here.

Claude Code can auto-invoke these skills when the descriptions match, or you can call them directly with commands such as /ai-research-reproduction, /ai-research-explore, and /safe-debug.

Project-scoped Claude Code slash commands currently ship for:

/ai-research-reproduction
/ai-research-explore
/analyze-project
/safe-debug

Advanced: local clone installs

Use the Python installer only if you are developing locally, need a project-scoped install, or want to target neutral Agent Skills, Codex, or Claude Code directories manually.

Install from a local clone into a neutral Agent Skills directory:

python scripts/install_skills.py --client agents --target "$HOME/.agents/skills" --force

Install into a project-scoped neutral Agent Skills directory:

python scripts/install_skills.py --client agents --target ./.agents/skills --force

Install with the default neutral target:

python scripts/install_skills.py --force

Install the full repository skill set in Codex:

npx skills add lllllllama/rigorpilot-skills --all

Install only the trusted reproduction orchestrator in Codex:

npx skills add lllllllama/rigorpilot-skills --skill ai-research-reproduction

Legacy GitHub source fallback, if the new slug is not yet available in your environment:

npx skills add lllllllama/ai-paper-reproduction-skills --all

Install from a local clone into Codex:

python scripts/install_skills.py --client codex --target "$HOME/.codex/skills" --force

Install from a local clone into Claude Code:

python scripts/install_skills.py --client claude --target "$HOME/.claude/skills" --force

Install into a project-scoped Claude Code skills directory:

python scripts/install_skills.py --client claude --target ./.claude/skills --force

PowerShell note:

In Windows PowerShell, the same commands work as written above.
If you prefer explicit Windows-style paths, replace $HOME/.codex/skills with something like $env:USERPROFILE\\.codex\\skills.

🎯 Choose an Entry Point

RigorPilot modes map to the current compatible skill slugs:

If you want to...	RigorPilot mode	Current skill slug
Reproduce a deep learning repository from README commands	Reproduce	`ai-research-reproduction`
Explore meaningful and potentially novel ideas on top of current research	Explore	`ai-research-explore`
Improve a baseline while preserving comparability	Improve	`ai-research-explore`, `explore-code`, `explore-run`
Audit changes, scientific meaning, and comparability	Audit	`analyze-project`, `safe-debug`, generated reports
Analyze repository structure without editing	Analyze	`analyze-project`
Prepare environment, datasets, weights, and cache assumptions	Setup	`env-and-assets-bootstrap`
Run documented evaluation or inference conservatively	Run	`minimal-run-and-audit`
Start or verify training conservatively	Train	`run-train`
Debug a failure safely	Debug	`safe-debug`

Bundled helper skills:

repo-intake-and-plan
paper-context-resolver

🛣️ Lane Model

🔒 Trusted Lane

Use the trusted lane for reproduction, setup, analysis, bounded execution, training verification, and debugging.

Primary end-to-end orchestrator: ai-research-reproduction
Output directories: repro_outputs/, train_outputs/, analysis_outputs/, debug_outputs/
Default stance: preserve scientific meaning, minimize semantic changes, surface assumptions and blockers

🧪 Explore Lane

Use the explore lane only when the researcher explicitly authorizes candidate-only exploratory work.

Primary end-to-end orchestrator: ai-research-explore
Narrow leaf skills: explore-code, explore-run
Output directory: explore_outputs/
Key anchor: current_research

current_research should be a durable reference such as a branch, commit, checkpoint, run record, or already-trained local model state. It does not imply a trusted baseline; it is the context the exploration branches from.

🧰 Helper Lane

Helpers are intentionally narrow and should usually be orchestrator-invoked rather than used as the first entry point.

🔗 Client Compatibility

SKILL.md is the canonical cross-client contract in this repository.

Required for portability: SKILL.md, repository-local scripts/, and references/
Optional Codex UI metadata: agents/openai.yaml
Optional Claude Code project entrypoints: .claude/commands/*.md
Not allowed: making skill behavior depend on a client-specific metadata file

See references/client-compatibility-policy.md.

🔁 Lifecycle View

The repository follows a lifecycle-oriented routing model:

flowchart LR
    A[Understand] --> B[Reproduce]
    B --> C[Set up]
    C --> D[Run or train]
    D --> E[Debug]
    E --> F[Report]
    B -. explicit only .-> G[Explore]
    G --> H[Rank candidates]
    H --> F

This lifecycle is intentionally shallow. It helps the agent choose the right lane and evidence target without forcing a fixed implementation sequence inside each repository.

🗺️ Routing Overview

flowchart TD
    A[User request] --> B{Explicit candidate-only exploration?}
    B -- No --> C[Trusted lane]
    B -- Yes --> D[Explore lane]

    C --> C1[ai-research-reproduction]
    C --> C2[analyze-project]
    C --> C3[env-and-assets-bootstrap]
    C --> C4[minimal-run-and-audit]
    C --> C5[run-train]
    C --> C6[safe-debug]

    D --> D1[ai-research-explore]
    D --> D2[explore-code]
    D --> D3[explore-run]

    C1 -. helper .-> H1[repo-intake-and-plan]
    C1 -. helper .-> H2[paper-context-resolver]

🧠 RigorPilot Explore Flow

ai-research-explore is the RigorPilot Explore entrypoint when the researcher has already frozen the task family, dataset, evaluation method, and provided SOTA references, then explicitly authorizes candidate-only exploration on top of current_research. In RigorPilot terms, this is meaningful and potentially novel candidate work, not verified novelty.

flowchart LR
    A[current_research + frozen campaign] --> B[Outer loop:<br/>understand, source, gate]
    B --> C{candidate worth trying?}
    C -- No --> D[Stop with blocker or checkpoint]
    C -- Yes --> E[Inner loop:<br/>bounded change or run]
    E --> F[Smoke and evidence]
    F --> G[Rank candidate]
    G --> B
    G --> H[explore_outputs<br/>candidate-only summary]

Current RigorPilot implementation highlights:

Researcher ideas are preserved, then optionally expanded with bounded synthesized or hybrid seed ideas in analysis_outputs/IDEA_SEEDS.json.
Idea ranking uses hard gates plus explicit weighted breakdowns in analysis_outputs/IDEA_SCORES.json.
Selected ideas are decomposed into atomic academic concepts in analysis_outputs/ATOMIC_IDEA_MAP.md and analysis_outputs/ATOMIC_IDEA_MAP.json.
Implementation fidelity distinguishes planned, heuristic, and observed implementation evidence in analysis_outputs/IMPLEMENTATION_FIDELITY.md and analysis_outputs/IMPLEMENTATION_FIDELITY.json.
Executor-observed evidence now comes from emitted changed_files, new_files, deleted_files, and touched_paths rather than planned target placeholders.

The two-loop rhythm is a guide, not a never-stop autonomous agent. RigorPilot Explore stops at explicit blockers, unclear scientific meaning, exhausted budget, missing anchors, or human checkpoints. The explore lane must not claim trusted reproduction success, global benchmark completeness, or verified novelty.

📦 Public Skill Matrix

Lane	Skill	Purpose
Trusted	`ai-research-reproduction`	End-to-end README-first reproduction orchestrator
Trusted	`env-and-assets-bootstrap`	Conservative environment, dataset, checkpoint, and cache planning
Trusted	`minimal-run-and-audit`	Trusted inference, evaluation, smoke, and sanity execution
Trusted	`analyze-project`	Read-only project analysis, model mapping, and risk surfacing
Trusted	`run-train`	Training startup verification, resume handling, bounded monitoring, and training records
Trusted	`safe-debug`	Research-safe debugging: analyze first, patch only after approval
Explore	`ai-research

...

minimal-run-and-audit

Before / After Comparison

RigorPilot Skills

What RigorPilot Is

What RigorPilot Is Not

Core Principles

Rigor and Novelty

Deep Learning Focus

🧭 Current Repo Snapshot

💻 Windows and Linux Notes

🛠️ Install

Recommended: `npx`

Advanced: local clone installs

🎯 Choose an Entry Point

🛣️ Lane Model

🔒 Trusted Lane

🧪 Explore Lane

🧰 Helper Lane

🔗 Client Compatibility

🔁 Lifecycle View

🗺️ Routing Overview

🧠 RigorPilot Explore Flow

📦 Public Skill Matrix

User Reviews (0)

Statistics

User Rating

Compatible Platforms

Timeline

minimal-run-and-audit

Before / After Comparison

RigorPilot Skills

What RigorPilot Is

What RigorPilot Is Not

Core Principles

Rigor and Novelty

Deep Learning Focus

🧭 Current Repo Snapshot

💻 Windows and Linux Notes

🛠️ Install

Recommended: npx

Advanced: local clone installs

🎯 Choose an Entry Point

🛣️ Lane Model

🔒 Trusted Lane

🧪 Explore Lane

🧰 Helper Lane

🔗 Client Compatibility

🔁 Lifecycle View

🗺️ Routing Overview

🧠 RigorPilot Explore Flow

📦 Public Skill Matrix

User Reviews (0)

Statistics

User Rating

Compatible Platforms

Timeline

Recommended: `npx`