Home/AI Agent Core Development/minimal-run-and-audit
M

minimal-run-and-audit

by @lllllllamav
4.8(656)

RigorPilot Skills are research-first agent skills designed for deep learning experiments. They empower AI agents to reproduce, improve, explore, and audit deep learning research projects with scientific rigor, ensuring meaningful changes, fair comparisons, reproducible evidence, and auditable modifications.

deep-learningai-researchreproducibilityagent-skillsexperimentationGitHub
Installation
npx skills add lllllllama/ai-paper-reproduction-skill --skill minimal-run-and-audit
compare_arrows

Before / After Comparison

1
Before

Without standardized tools, AI agents or researchers often spend significant time troubleshooting deep learning experiment reproduction and auditing issues due to environment differences, code changes, or poor data management, leading to slow research progress and unreliable results.

After

RigorPilot ensures scientific rigor, reproducibility, and auditability of experiments. AI agents can quickly identify and verify experiment settings and results, significantly reducing troubleshooting time, accelerating research iteration, and improving result credibility.

SKILL.md

RigorPilot Skills

Research-first Agent Skills for Deep Learning Experiments.

RigorPilot helps AI agents reproduce, improve, and explore deep learning research projects with scientific rigor: meaningful changes, fair comparison, reproducible evidence, and auditable modifications.

Not just higher scores. Meaningful deep learning research progress.

Brand note: the project brand is RigorPilot Skills; the recommended GitHub repository slug is rigorpilot-skills. Legacy install paths remain documented only as compatibility fallbacks while older clients and bookmarks migrate.

Migration note:

  • Project brand: ai-research-workflow-skills -> RigorPilot Skills
  • Existing compatible skill slugs remain available.
  • Preferred install source: lllllllama/rigorpilot-skills
  • Legacy fallback source: lllllllama/ai-paper-reproduction-skills
  • ai-paper-reproduction -> ai-research-reproduction
  • research-explore -> ai-research-explore

What RigorPilot Is

  • Research-first Agent Skills for deep learning experiments.
  • It helps AI agents reproduce, improve, explore, and audit deep learning research work.
  • It is designed for personal research use first.
  • It values scientific meaning, fair comparison, reproducibility, explainability, and collaborator control.
  • It encourages meaningful novelty during exploration, but does not overclaim novelty.

What RigorPilot Is Not

  • Not a generic coding agent.
  • Not a score-chasing automation framework.
  • Not a guarantee of novel discoveries.
  • Not a replacement for researcher judgment.
  • Not a rigid workflow that should weaken strong models.

Core Principles

  1. Do not chase scores blindly.
  2. Do not claim novelty lightly.
  3. Do not break comparability silently.
  4. Do not disguise engineering fixes as research contributions.
  5. Do not leave collaborators out of control.

See references/research-rigor-principles.md.

Rigor and Novelty

Rigor is the baseline. Novel is the aspiration.

Novelty and significance remain hypotheses until supported by literature contrast, ablation evidence, and fair comparison.

RigorPilot should add research judgment and audit awareness without making strong models slower, more mechanical, or less capable.

Deep Learning Focus

RigorPilot is built for deep learning research repositories where README commands, environment setup, data, weights, checkpoints, training, evaluation, metrics, logs, baselines, SOTA tables, and ablations all matter.

This repository is still built around one compatibility rule: trusted by default.

  • Ambiguous requests route to the trusted lane.
  • Exploration requires explicit authorization.
  • Trusted outputs are auditable and durable.
  • Explore outputs are candidate-only and disposable.

Shared operating principles live in references/agent-operating-principles.md. They keep the skills focused on high-level guidance: think before acting, keep the solution small, change only what is necessary, and work toward verifiable goals. They are guardrails, not a detailed script for every implementation choice.

🧭 Current Repo Snapshot

This repository currently ships:

  • 11 skills total: 9 public skills and 2 helper skills.
  • 6 trusted-lane public skills and 3 explore-lane public skills.
  • 4 project-scoped Claude Code command wrappers under .claude/commands/.
  • 45 Python scripts, including 43 test scripts with focused research-explore regressions and document-structure checks.
  • A RigorPilot Explore chain that now includes bounded idea-seed generation, explicit idea score breakdowns, atomic idea decomposition, and implementation-fidelity evidence split into planned, heuristic, and observed layers.
  • A documented and tested workflow intended to be usable from both Windows PowerShell and Linux shells.

The skills use the open SKILL.md layout, so the same repository can be installed into neutral Agent Skills directories as well as Codex and Claude Code. For shared local installs, prefer ~/.agents/skills/ or ./.agents/skills/. Client-specific installs under ~/.codex/skills/ and ~/.claude/skills/ remain supported.

💻 Windows and Linux Notes

This repository is intended to be usable on both Windows and Linux.

  • The command examples below are written in a shell-neutral style around python ..., npx ..., and relative paths.
  • For user-scoped install targets, prefer $HOME/.agents/skills, $HOME/.codex/skills, and $HOME/.claude/skills. These work well in Linux shells and in PowerShell, and Python accepts forward slashes on Windows paths.
  • Project-scoped paths such as ./.agents/skills and ./tmp/codex-skills are also valid on both platforms.
  • The repository validation and routing checks are already exercised on Windows and Linux-oriented environments through local tests and CI.

🛠️ Install

For most users, start with npx. It is the shortest path and should be enough for normal use.

Recommended: npx

Install the full repository skill set:

npx skills add lllllllama/rigorpilot-skills --all

Install only the trusted main entrypoint:

npx skills add lllllllama/rigorpilot-skills --skill ai-research-reproduction

Install only the exploratory main entrypoint:

npx skills add lllllllama/rigorpilot-skills --skill ai-research-explore

If you only want to get started quickly, stop here.

Claude Code can auto-invoke these skills when the descriptions match, or you can call them directly with commands such as /ai-research-reproduction, /ai-research-explore, and /safe-debug.

Project-scoped Claude Code slash commands currently ship for:

  • /ai-research-reproduction
  • /ai-research-explore
  • /analyze-project
  • /safe-debug

Advanced: local clone installs

Use the Python installer only if you are developing locally, need a project-scoped install, or want to target neutral Agent Skills, Codex, or Claude Code directories manually.

Install from a local clone into a neutral Agent Skills directory:

python scripts/install_skills.py --client agents --target "$HOME/.agents/skills" --force

Install into a project-scoped neutral Agent Skills directory:

python scripts/install_skills.py --client agents --target ./.agents/skills --force

Install with the default neutral target:

python scripts/install_skills.py --force

Install the full repository skill set in Codex:

npx skills add lllllllama/rigorpilot-skills --all

Install only the trusted reproduction orchestrator in Codex:

npx skills add lllllllama/rigorpilot-skills --skill ai-research-reproduction

Legacy GitHub source fallback, if the new slug is not yet available in your environment:

npx skills add lllllllama/ai-paper-reproduction-skills --all

Install from a local clone into Codex:

python scripts/install_skills.py --client codex --target "$HOME/.codex/skills" --force

Install from a local clone into Claude Code:

python scripts/install_skills.py --client claude --target "$HOME/.claude/skills" --force

Install into a project-scoped Claude Code skills directory:

python scripts/install_skills.py --client claude --target ./.claude/skills --force

PowerShell note:

  • In Windows PowerShell, the same commands work as written above.
  • If you prefer explicit Windows-style paths, replace $HOME/.codex/skills with something like $env:USERPROFILE\\.codex\\skills.

🎯 Choose an Entry Point

RigorPilot modes map to the current compatible skill slugs:

If you want to...RigorPilot modeCurrent skill slug
Reproduce a deep learning repository from README commandsReproduceai-research-reproduction
Explore meaningful and potentially novel ideas on top of current researchExploreai-research-explore
Improve a baseline while preserving comparabilityImproveai-research-explore, explore-code, explore-run
Audit changes, scientific meaning, and comparabilityAuditanalyze-project, safe-debug, generated reports
Analyze repository structure without editingAnalyzeanalyze-project
Prepare environment, datasets, weights, and cache assumptionsSetupenv-and-assets-bootstrap
Run documented evaluation or inference conservativelyRunminimal-run-and-audit
Start or verify training conservativelyTrainrun-train
Debug a failure safelyDebugsafe-debug

Bundled helper skills:

  • repo-intake-and-plan
  • paper-context-resolver

🛣️ Lane Model

🔒 Trusted Lane

Use the trusted lane for reproduction, setup, analysis, bounded execution, training verification, and debugging.

  • Primary end-to-end orchestrator: ai-research-reproduction
  • Output directories: repro_outputs/, train_outputs/, analysis_outputs/, debug_outputs/
  • Default stance: preserve scientific meaning, minimize semantic changes, surface assumptions and blockers

🧪 Explore Lane

Use the explore lane only when the researcher explicitly authorizes candidate-only exploratory work.

  • Primary end-to-end orchestrator: ai-research-explore
  • Narrow leaf skills: explore-code, explore-run
  • Output directory: explore_outputs/
  • Key anchor: current_research

current_research should be a durable reference such as a branch, commit, checkpoint, run record, or already-trained local model state. It does not imply a trusted baseline; it is the context the exploration branches from.

🧰 Helper Lane

Helpers are intentionally narrow and should usually be orchestrator-invoked rather than used as the first entry point.

🔗 Client Compatibility

SKILL.md is the canonical cross-client contract in this repository.

  • Required for portability: SKILL.md, repository-local scripts/, and references/
  • Optional Codex UI metadata: agents/openai.yaml
  • Optional Claude Code project entrypoints: .claude/commands/*.md
  • Not allowed: making skill behavior depend on a client-specific metadata file

See references/client-compatibility-policy.md.

🔁 Lifecycle View

The repository follows a lifecycle-oriented routing model:

flowchart LR
    A[Understand] --> B[Reproduce]
    B --> C[Set up]
    C --> D[Run or train]
    D --> E[Debug]
    E --> F[Report]
    B -. explicit only .-> G[Explore]
    G --> H[Rank candidates]
    H --> F

This lifecycle is intentionally shallow. It helps the agent choose the right lane and evidence target without forcing a fixed implementation sequence inside each repository.

🗺️ Routing Overview

flowchart TD
    A[User request] --> B{Explicit candidate-only exploration?}
    B -- No --> C[Trusted lane]
    B -- Yes --> D[Explore lane]

    C --> C1[ai-research-reproduction]
    C --> C2[analyze-project]
    C --> C3[env-and-assets-bootstrap]
    C --> C4[minimal-run-and-audit]
    C --> C5[run-train]
    C --> C6[safe-debug]

    D --> D1[ai-research-explore]
    D --> D2[explore-code]
    D --> D3[explore-run]

    C1 -. helper .-> H1[repo-intake-and-plan]
    C1 -. helper .-> H2[paper-context-resolver]

🧠 RigorPilot Explore Flow

ai-research-explore is the RigorPilot Explore entrypoint when the researcher has already frozen the task family, dataset, evaluation method, and provided SOTA references, then explicitly authorizes candidate-only exploration on top of current_research. In RigorPilot terms, this is meaningful and potentially novel candidate work, not verified novelty.

flowchart LR
    A[current_research + frozen campaign] --> B[Outer loop:<br/>understand, source, gate]
    B --> C{candidate worth trying?}
    C -- No --> D[Stop with blocker or checkpoint]
    C -- Yes --> E[Inner loop:<br/>bounded change or run]
    E --> F[Smoke and evidence]
    F --> G[Rank candidate]
    G --> B
    G --> H[explore_outputs<br/>candidate-only summary]

Current RigorPilot implementation highlights:

  • Researcher ideas are preserved, then optionally expanded with bounded synthesized or hybrid seed ideas in analysis_outputs/IDEA_SEEDS.json.
  • Idea ranking uses hard gates plus explicit weighted breakdowns in analysis_outputs/IDEA_SCORES.json.
  • Selected ideas are decomposed into atomic academic concepts in analysis_outputs/ATOMIC_IDEA_MAP.md and analysis_outputs/ATOMIC_IDEA_MAP.json.
  • Implementation fidelity distinguishes planned, heuristic, and observed implementation evidence in analysis_outputs/IMPLEMENTATION_FIDELITY.md and analysis_outputs/IMPLEMENTATION_FIDELITY.json.
  • Executor-observed evidence now comes from emitted changed_files, new_files, deleted_files, and touched_paths rather than planned target placeholders.

The two-loop rhythm is a guide, not a never-stop autonomous agent. RigorPilot Explore stops at explicit blockers, unclear scientific meaning, exhausted budget, missing anchors, or human checkpoints. The explore lane must not claim trusted reproduction success, global benchmark completeness, or verified novelty.

📦 Public Skill Matrix

LaneSkillPurpose
Trustedai-research-reproductionEnd-to-end README-first reproduction orchestrator
Trustedenv-and-assets-bootstrapConservative environment, dataset, checkpoint, and cache planning
Trustedminimal-run-and-auditTrusted inference, evaluation, smoke, and sanity execution
Trustedanalyze-projectRead-only project analysis, model mapping, and risk surfacing
Trustedrun-trainTraining startup verification, resume handling, bounded monitoring, and training records
Trustedsafe-debugResearch-safe debugging: analyze first, patch only after approval
Explore`ai-research

...

User Reviews (0)

Write a Review

Effect
Usability
Docs
Compatibility

No reviews yet

Statistics

Installs127.4K
Rating4.8 / 5.0
Version
Updated2026年5月23日
Comparisons1

User Rating

4.8(656)
5
41%
4
47%
3
12%
2
1%
1
0%

Rate this Skill

0.0

Compatible Platforms

🔧Claude Code

Timeline

Created2026年3月31日
Last Updated2026年5月23日