Publish SWE-bench Results via GitHub Pages

Problem / Motivation

eforge benchmarks now produce real results (2/5 resolved on the latest run), but there is no way to publish or track them over time. There is no public-facing site showing methodology, latest results, or historical runs. Without a publishing mechanism, benchmark progress is invisible and untrackable.

Goal

Ship a simple GitHub Pages site that displays methodology, latest SWE-bench results, and historical runs — updated via a manual publish script after each benchmark run.

Approach

Jekyll site in `docs/` on `main` branch

GitHub Pages serves directly from docs/ — no build step, no separate branch required.

docs/
  _config.yml                    # Jekyll config (theme: minima)
  _data/
    runs.json                    # Append-only historical run data
  index.md                       # Homepage: latest results + history table
  methodology.md                 # Static: how eforge works, benchmark approach
  results/
    index.md                     # All runs table (regenerated)
    <timestamp>.md               # Per-run detail page (generated)

`publish.py` — manual publish script (~200 lines, stdlib only)

Usage: python publish.py results/2026-03-28T03-05-38/ [--notes "description"]

Steps:

Load data from 3 sources:
- results/<timestamp>/config.json — run config (instances, timeout, dataset)
- results/<timestamp>/eforge_metadata.jsonl — per-instance timing, exit codes, failure reasons
- eforge.eforge_predictions.json (repo root) — SWE-bench eval report (resolved/unresolved IDs)
Detect eforge version via npm list -g eforge --json
Build run summary and append to docs/_data/runs.json (duplicate check by timestamp)
Generate docs/results/<timestamp>.md — per-run detail page
Regenerate docs/results/index.md — all runs table
Regenerate docs/index.md — homepage with latest results
Print summary, remind user to commit + push (no auto-commit)

Data per run in runs.json:

timestamp, date, dataset, eforge_version
num_instances, num_resolved, resolution_rate
resolved_ids, unresolved_ids, empty_patch_ids
per_instance: [{instance_id, status, failure_reason, duration_seconds, patch_lines}]
notes (optional)

`docs/methodology.md` — static page

Written once, maintained by hand. Covers:

What eforge is (multi-agent pipeline: planner, builder, reviewer)
SWE-bench Lite and what “resolved” means
Docker harness: how instances are run, PRD generation, patch extraction
Why runs use different instance subsets
Known limitations (cost, timeout, contamination)

Files to create

File	Type	Description
`docs/_config.yml`	Create	Jekyll config (minima theme, baseurl: `/benchmarks`)
`docs/index.md`	Generate	Homepage with latest results + history
`docs/methodology.md`	Create	Static methodology explanation
`docs/results/index.md`	Generate	All runs table
`docs/_data/runs.json`	Create	Empty `[]`, appended by `publish.py`
`publish.py`	Create	~200 lines, stdlib only

Post-implementation steps

Run python publish.py results/2026-03-28T03-05-38/ --notes "First published run"
Review generated docs, commit + push
Enable GitHub Pages in repo settings: Source → main branch → /docs folder

Scope

In scope:

Jekyll site structure in docs/ served via GitHub Pages from main branch
publish.py script that reads benchmark output, generates/updates all site pages, and appends to runs.json
Static methodology.md page covering eforge architecture, SWE-bench Lite, Docker harness, subset rationale, and known limitations
Homepage (index.md) showing latest results and history table
Per-run detail pages (results/<timestamp>.md) with per-instance tables
All-runs index page (results/index.md)
Append-only runs.json with duplicate-check by timestamp
eforge version detection via npm list -g eforge --json

Out of scope:

Automatic commits or pushes (script prints a reminder only)
CI/CD or automated publishing pipeline
Separate branch or external build step for GitHub Pages
Any dependencies beyond Python stdlib for publish.py

Acceptance Criteria

Running python publish.py results/2026-03-28T03-05-38/ produces:
- docs/_data/runs.json containing exactly 1 entry with correct data (timestamp, resolved/unresolved IDs, resolution rate, per-instance details)
- docs/results/2026-03-28T03-05-38.md exists and contains a per-instance table
- docs/index.md displays the latest results and a history table
- docs/results/index.md displays an all-runs table
Running publish.py a second time on the same timestamp is rejected (duplicate check).
The generated markdown pages render correctly when served by Jekyll (verified via bundle exec jekyll serve or manual review).
docs/methodology.md exists as a static page covering: eforge multi-agent pipeline, SWE-bench Lite definition of “resolved,” Docker harness details, subset rationale, and known limitations.
docs/_config.yml is configured with theme minima and baseurl /benchmarks.
publish.py uses only Python stdlib (no external dependencies) and is approximately 200 lines.
publish.py prints a summary after execution and reminds the user to commit and push.