- concept
- Spending more compute at inference (search against a verifier, or letting reinforcement learning incentivize longer reasoning) can lift accuracy on verifiable tasks without a bigger model, but the gain is bounded by problem difficulty and by rising cost.
- source_ref
- 2408.03314v1 11884e3b... (compute-optimal test-time scaling) and 2501.12948v2 d10f13ca... (DeepSeek-R1, reinforcement-learning-incentivized reasoning).
- plain_language
- Let the model think harder on a hard question and it often does better, but only if you spend the extra thinking where it actually helps, and the help runs out.
- minimal_demo
- On a problem set, compare one-shot accuracy to best-of-N and to verifier-guided search at matched compute, then plot accuracy per token.
- failure_case
- 2408.03314 states that the effectiveness of each scaling method varies critically with prompt difficulty, and that compute-optimal allocation only beats best-of-N (about 4x more efficient) when difficulty is matched. DeepSeek-R1's gains are on verifiable tasks (math, coding, STEM); a longer chain of thought does not make it faithful.
- matched_current
- 2509.19681v1 041e5150... (Calibrated Reasoning) argues test-time strategies are capped by the models' poor self-evaluation and adds a calibrated explanatory verifier; 2504.13171v1 f732d910... (Sleep-time Compute) moves some cost offline (about 5x less test-time compute for equal accuracy when the query is predictable).
- claim
- More inference compute raises accuracy on verifiable problems, but the marginal accuracy per token falls and depends on difficulty; it is not a free or unbounded lever.
- crucible_verdict
- { status: UNVERIFIABLE, by: -, when: - } · not yet adjudicated
- limits
- Does not establish gains on non-verifiable or open-ended tasks; does not establish faithful reasoning, only better answers.
The Learning Forge
Frontier AI papers and talks turned into falsifiable claim cards, each carrying its own hashed evidence.
v0 · 2026-06-30 · grounded in a sealed 16-item corpus / some cards fully grounded, the rest marked ungrounded on purpose
What this is
The Learning Forge turns frontier AI talks and papers into verified learning objects. A learning object is not a summary. It is a small unit that carries its own evidence: a concept, the source it came from with a hash, a plain-language intuition, a minimal demo, an honest failure case, a matched current source, and a falsifiable claim with a verdict slot that a separate judge can later fill MATCH, DRIFT, or UNVERIFIABLE.
An earlier research map proposed this Forge and its ten modules, but that map was built without the sources. It could not fetch the papers and talks, so its claim cards rested on memory rather than evidence. This version grounds that map. The sources are now gathered, hashed, and sealed, so each claim card cites a real item that anyone can re-verify against the corpus. The honest result: six cards are fully grounded, five of ten modules are solidly evidence-backed, and the gaps are named rather than papered over.
The corpus, sealed
Twenty items were gathered, four were dropped, and sixteen were kept, digested, and stored in a content-addressed object store. Each item lives under its own sha256, so the hash printed on a card is also the key that retrieves the item. The two seals below let any reader re-derive the full inventory and confirm nothing was swapped after the fact.
- Corpus
- a content-addressed object store of 16 frontier AI sources (10 arXiv papers, 5 talks; one talk carries both a metadata and a transcript object)
- Digest seal
- 0f364c2173e27ee1848ffa87eb2c99ac232f9bfcc144f2d7feec6af37cd33399
- Run seal
- e1ba77514c2f6b3f714ff23c537c2db614f3c580af5f0cabea25b8c19a3d45b4
- Counts
- gathered 20, dropped 4, kept and digested 16, stored 16
- Scope tags
- reasoning, agents, evaluation, reproducibility, interpretability, test-time, coding, benchmark, verifier, world-model, ai-for-science
Each item is stored under objects/<first-2-hex>/<rest-of-sha256>, so the hash in a card is the retrieval key. The seals above are the re-verification handle, not a claim of correctness. Gathering and citing a source is not the same as adjudicating its claim.
The shape of a claim card
Every learning object follows the same schema. Each field does one job: name the concept, point at the hashed source, explain it without jargon, show the smallest thing that demonstrates it, state where it breaks, match it to a more recent source, reduce it to one falsifiable sentence, and leave an empty slot for a separate judge. The two fields that keep the card honest are failure_case and limits: they say out loud what the card does not establish.
- id
- short slug
- concept
- one sentence, the thing being taught
- source_ref
- corpus item id plus sha256 (the retrieval key)
- plain_language
- intuition with no jargon
- minimal_demo
- the smallest hand-checkable thing that shows it
- failure_case
- where it breaks, stated honestly
- matched_current
- a more recent source that extends or limits it
- claim
- a falsifiable sentence
- crucible_verdict
- a slot for a separate judge: status MATCH, DRIFT, or UNVERIFIABLE, plus by and when (empty in this version)
- module
- one of the 10 modules
- limits
- what this card does NOT establish
The verdict slot is deliberately empty in this version. Grounding a claim in a hashed source is not the same as adjudicating it. The Forge gathers and cites; a separate judgment organ (Crucible) judges, in a separate step. Leaving the slot empty is the honest state, not an oversight.
One card, in full
Here is a single learning object as it stands in this version, with every field filled except the verdict, which is empty by design.
The other grounded cards
Five more cards follow the same schema, each tied to a specific hashed item. The summaries below give the concept, the claim, and the limit each card refuses to overstep.
- Claim. A benchmark score is evidence of capability only after contamination, retrieval, and the coding-versus-engineering distinction are controlled for.
- Grounding. SWE-bench (2310.06770v3) reports that at release the best model resolved only 1.96% of 2,294 real issues, so early "coding is solved" framings were not grounded; the Jeremy Howard talk argues current models are weak at software engineering and produce near-copies of existing work.
- Limits. Does not claim all coding-agent results are contaminated; claims the default reading of a raw score overstates capability.
- Claim. A model evaluation reports a distribution and a reliability level against a defined task distribution; a single accuracy number without those is not a scientific result.
- Grounding. The Beth Barnes and David Rein talk on time-horizon measurement: the dominant uncertainty is generalization to the real world, not the standard error, and a regularization bug shifted a reported 50% horizon by about 35%.
- Limits. Does not provide a single correct reliability threshold; the right threshold depends on the question being asked.
- Claim. An AI-for-science result is credible when it names the experiment it predicts, reports validity against that experiment, and bounds its scope.
- Grounding. The John Jumper talk is explicit that AlphaFold predicts one narrow category of measurement rather than modeling the cell; the Robert Lange talk shows autonomous discovery often stalls without a reframed surrogate problem.
- Limits. Does not claim AlphaFold-level validity generalizes to messier biology; Jumper himself says most of biology is still unknown.
- Claim. Interpretability output is an explanation with stated coverage, not a proof of behavior; treat it as evidence, not a verdict.
- Grounding. 2402.03855v2 argues most mechanistic-interpretability work so far studies trivial, token-aligned behaviors and that established methods are insufficient for hidden representations; later tooling (Equivariant SAEs, Prisma) extends the field while showing its own assumptions still move.
- Limits. Does not claim interpretability is unusable; claims it is partial and must be labeled as such.
- Claim. Pretraining yields useful internal abstractions and strong interpolation; it does not yield reliable out-of-distribution extrapolation.
- Grounding. The Howard talk frames pretraining as compression into a hierarchy of abstractions that interpolates well but degrades sharply outside the data; the Michael Jordan talk pushes back on the word "understanding" and frames intelligence as collective. Both are cited and the tension is left standing.
- Limits. The interpolation surface is vast and its true limits are unknown; this card states a direction, not a hard boundary.
Grounded versus ungrounded, counted honestly
The earlier map's ten modules are kept exactly as-is, so the grounding is auditable against the original. Five of ten modules are solidly evidence-backed by at least one specific, on-point item in the sealed corpus. Two are partial: covered only through a single talk, still wanting a primary reference. Three have nothing on-point and are marked as needing sources gathered next. Naming the empty modules is the point, not a footnote.
- Grounded (5)
- (4) reasoning and test-time compute · (6) coding agents and SWE benchmarks · (7) evaluation and reproducibility · (8) interpretability · (10) AI-for-science
- Partial (2)
- (2) what a model is · (3) what an LLM is. Both grounded only through the Howard talk's compression-to-abstractions argument; each would benefit from a primary architecture or pretraining paper.
- Not covered (3)
- (1) programming-as-evidence · (5) agents, tools, and MCP · (9) efficient and alternative compute. The corpus has little or nothing on-point; these are the next things to gather.
Net: 5 of 10 modules are solidly evidence-backed, 2 are partial, and 3 need gathering. Six claim cards are grounded; none are adjudicated. Every verdict slot reads UNVERIFIABLE because judging is a separate organ and a separate step that has not run yet.
First three labs
Each lab is tied to a gathered, hashed source and produces an artifact a judge can later mark MATCH, DRIFT, or UNVERIFIABLE. The success criterion is explicit in every case: "it runs" is never the bar, and a result that contradicts the card is itself a recorded outcome.
- On a small verifiable task set, run one-shot, best-of-N, and a simple verifier-guided selection, then plot accuracy versus tokens spent.
- Success criterion. The curve shows diminishing accuracy per token, and verifier-guided selection beats best-of-N at a matched token budget on the harder slice. If it does not, the C1 claim drifts.
- Split coding tasks by likely-seen versus likely-unseen relative to a training cutoff, measure resolution rate on each, and inspect solutions for copied structure.
- Success criterion. A measurable gap between seen and unseen resolution rates, with at least one concrete example of retrieval rather than engineering. A zero gap is itself a recorded result against C2.
- Surface an apparently interpretable feature, then construct an input where that feature's reading fails to predict the model's behavior.
- Success criterion. At least one feature that looks interpretable and at least one case where it mispredicts behavior, both logged. This demonstrates C5 rather than asserting it.
On honesty and method
Every claim card cites a sha256 that resolves to a real object in the sealed corpus, and the two seals let any reader re-derive the inventory. Verdict slots are empty on purpose: gathering and citing is done, judging is a separate organ and a separate step. The earlier map's ten modules are kept as-is so the grounding is auditable against it, and the accounting states plainly which modules the evidence does and does not reach. No card asserts more than its read source supports, and where two sources disagree, both are cited and the tension is left standing rather than resolved. That is the whole discipline: carry a re-checkable proof, never ask to be trusted.
Updated 2026-06-30. v0, grounded in a now-verified corpus.
Public showcase. No secrets, no private repository internals, and no claim beyond what the read sources support. The corpus seals and per-item hashes are the re-verification handle.