Publish SWE-bench Results via GitHub Pages
Publish SWE-bench Results via GitHub Pages
Problem / Motivation
eforge benchmarks now produce real results (2/5 resolved on the latest run), but there is no way to publish or track them over time. There is no public-facing site showing methodology, latest results, or historical runs. Without a publishing mechanism, benchmark progress is invisible and untrackable.
Goal
Ship a simple GitHub Pages site that displays methodology, latest SWE-bench results, and historical runs — updated via a manual publish script after each benchmark run.
Approach
Jekyll site in docs/ on main branch
GitHub Pages serves directly from docs/ — no build step, no separate branch required.
docs/
_config.yml # Jekyll config (theme: minima)
_data/
runs.json # Append-only historical run data
index.md # Homepage: latest results + history table
methodology.md # Static: how eforge works, benchmark approach
results/
index.md # All runs table (regenerated)
<timestamp>.md # Per-run detail page (generated)
publish.py — manual publish script (~200 lines, stdlib only)
Usage: python publish.py results/2026-03-28T03-05-38/ [--notes "description"]
Steps:
- Load data from 3 sources:
results/<timestamp>/config.json— run config (instances, timeout, dataset)results/<timestamp>/eforge_metadata.jsonl— per-instance timing, exit codes, failure reasonseforge.eforge_predictions.json(repo root) — SWE-bench eval report (resolved/unresolved IDs)
- Detect eforge version via
npm list -g eforge --json - Build run summary and append to
docs/_data/runs.json(duplicate check by timestamp) - Generate
docs/results/<timestamp>.md— per-run detail page - Regenerate
docs/results/index.md— all runs table - Regenerate
docs/index.md— homepage with latest results - Print summary, remind user to commit + push (no auto-commit)
Data per run in runs.json:
timestamp,date,dataset,eforge_versionnum_instances,num_resolved,resolution_rateresolved_ids,unresolved_ids,empty_patch_idsper_instance:[{instance_id, status, failure_reason, duration_seconds, patch_lines}]notes(optional)
docs/methodology.md — static page
Written once, maintained by hand. Covers:
- What eforge is (multi-agent pipeline: planner, builder, reviewer)
- SWE-bench Lite and what “resolved” means
- Docker harness: how instances are run, PRD generation, patch extraction
- Why runs use different instance subsets
- Known limitations (cost, timeout, contamination)
Files to create
| File | Type | Description |
|---|---|---|
docs/_config.yml |
Create | Jekyll config (minima theme, baseurl: /benchmarks) |
docs/index.md |
Generate | Homepage with latest results + history |
docs/methodology.md |
Create | Static methodology explanation |
docs/results/index.md |
Generate | All runs table |
docs/_data/runs.json |
Create | Empty [], appended by publish.py |
publish.py |
Create | ~200 lines, stdlib only |
Post-implementation steps
- Run
python publish.py results/2026-03-28T03-05-38/ --notes "First published run" - Review generated docs, commit + push
- Enable GitHub Pages in repo settings: Source → main branch →
/docsfolder
Scope
In scope:
- Jekyll site structure in
docs/served via GitHub Pages frommainbranch publish.pyscript that reads benchmark output, generates/updates all site pages, and appends toruns.json- Static
methodology.mdpage covering eforge architecture, SWE-bench Lite, Docker harness, subset rationale, and known limitations - Homepage (
index.md) showing latest results and history table - Per-run detail pages (
results/<timestamp>.md) with per-instance tables - All-runs index page (
results/index.md) - Append-only
runs.jsonwith duplicate-check by timestamp - eforge version detection via
npm list -g eforge --json
Out of scope:
- Automatic commits or pushes (script prints a reminder only)
- CI/CD or automated publishing pipeline
- Separate branch or external build step for GitHub Pages
- Any dependencies beyond Python stdlib for
publish.py
Acceptance Criteria
- Running
python publish.py results/2026-03-28T03-05-38/produces:docs/_data/runs.jsoncontaining exactly 1 entry with correct data (timestamp, resolved/unresolved IDs, resolution rate, per-instance details)docs/results/2026-03-28T03-05-38.mdexists and contains a per-instance tabledocs/index.mddisplays the latest results and a history tabledocs/results/index.mddisplays an all-runs table
- Running
publish.pya second time on the same timestamp is rejected (duplicate check). - The generated markdown pages render correctly when served by Jekyll (verified via
bundle exec jekyll serveor manual review). docs/methodology.mdexists as a static page covering: eforge multi-agent pipeline, SWE-bench Lite definition of “resolved,” Docker harness details, subset rationale, and known limitations.docs/_config.ymlis configured with thememinimaand baseurl/benchmarks.publish.pyuses only Python stdlib (no external dependencies) and is approximately 200 lines.publish.pyprints a summary after execution and reminds the user to commit and push.