6A — Ruff Harness (Production Reference)
Overview
The first production-validated harness for autoresearch. Uses ruff (Python linter) as the deterministic metric. Zero infrastructure — runs in 2 seconds from any repo with Python files.
Why Ruff (Not SonarQube)
SonarQube was the original metric choice. It failed in production due to:
| Issue | Impact |
|---|---|
| Scanner Docker image (Alpine) has no Node.js | JS-in-HTML analysis crashes, throws away Python results |
| Scanner v8.0.1 uses v2 API | Incompatible with SonarQube CE |
| Scanner v5.0.1 can’t run without Java | WSL2/containers don’t have Java |
| 5+ minute scan cycle (rsync → Docker → poll) | Too slow for experiment loops |
| Network routing issues on Unraid | Containers can’t reach SonarQube |
Ruff solves all of these: pip install ruff, runs in 2 seconds, 800+ Python rules, zero dependencies.
Configuration
# .autoresearch/config.yaml
markers:
- name: sonar-quality
description: "Reduce ruff lint errors across Python codebase"
target:
mutable:
- "automation/**/*.py"
- "stacks/**/*.py"
- "packages/**/*.py"
- "tests/**/*.py"
- "agents/**/*.py"
- "cli/**/*.py"
immutable:
- .autoresearch/config.yaml
metric:
command: "ruff check . 2>&1"
extract: "grep -oP 'Found \\K\\d+'"
direction: lower
baseline: 163
target: 0
issues_command: "ruff check . --output-format concise 2>&1 | head -30"
guard:
command: "ruff check . 2>&1 | grep -qP 'Found \\d+'"
rework_attempts: 1
agent:
budget_per_experiment: 20m
max_experiments: 1
Key Fields
metric.command + metric.extract
$ ruff check . 2>&1
# ... individual errors ...
# Found 163 errors.
# [*] 120 fixable with the `--fix` option.
$ ruff check . 2>&1 | grep -oP 'Found \K\d+'
163
metric.issues_command
This is injected into the agent’s prompt as exact file:line:rule issues:
$ ruff check . --output-format concise 2>&1 | head -30
agents/anton/recommendation/agent.py:19:47: F401 [*] `prompt.DEFAULT_SYSTEM_PROMPT` imported but unused
packages/litellm_manager/litellm_manager/cli.py:7:1: E402 Module level import not at top of file
packages/litellm_manager/litellm_manager/commands/keys.py:63:16: F541 [*] f-string without any placeholders
The agent gets EXACT targets — no exploration needed. This is the single biggest improvement to experiment success rate.
guard.command
The guard just confirms ruff runs without crashing (the code is valid Python). The metric itself is the real validation.
Production Results (antoncore, 2026-04-05)
| Run | Baseline | Result | Delta | Status |
|---|---|---|---|---|
| 1 | 186 | 163 | -23 | KEEP |
| 2 | 163 | 133 | -30 | KEEP |
Total: 186 → 133 in 2 experiments. -53 errors. PRs: #218 (merged), #219 (pending approval).
Lessons Learned
- Issues command is critical — without it, the agent spends 15 minutes exploring. With it, fixes are surgical and fast.
- Mutable paths must cover all directories with issues — agent silently skips files outside mutable list.
- Baseline must match current state — stale baseline in state.json causes “no improvement” false discards.
- Guard should be trivial — complex test suites with missing deps fail the guard. Ruff-as-guard is simple and reliable.
- Always-commit in engine — agent may timeout before committing. Engine does
git add -A && commitregardless of agent success/failure.
Applying to Other Repos
# Any Python repo:
cat > .autoresearch/config.yaml << 'EOF'
markers:
- name: lint-quality
target:
mutable: ["src/**/*.py", "tests/**/*.py"]
immutable: [.autoresearch/config.yaml]
metric:
command: "ruff check . 2>&1"
extract: "grep -oP 'Found \\K\\d+'"
direction: lower
baseline: 0 # will be set by first scan
issues_command: "ruff check . --output-format concise 2>&1 | head -30"
guard:
command: "ruff check . 2>&1 | grep -qP 'Found \\d+'"
agent:
model: sonnet
budget_per_experiment: 20m
max_experiments: 5
EOF
# Get baseline
ruff check . 2>&1 | grep -oP 'Found \K\d+'
# Update baseline in config.yaml
# Register and run
autoresearch add --path .
autoresearch run --marker <repo>:lint-quality