Research · the Learning Forge ResearchLearning Forge

The Learning Forge

Frontier AI papers and talks turned into falsifiable claim cards, each carrying its own hashed evidence.

v0 · 2026-06-30 · grounded in a sealed 16-item corpus / some cards fully grounded, the rest marked ungrounded on purpose

What this is

The Learning Forge turns frontier AI talks and papers into verified learning objects. A learning object is not a summary. It is a small unit that carries its own evidence: a concept, the source it came from with a hash, a plain-language intuition, a minimal demo, an honest failure case, a matched current source, and a falsifiable claim with a verdict slot that a separate judge can later fill MATCH, DRIFT, or UNVERIFIABLE.

An earlier research map proposed this Forge and its ten modules, but that map was built without the sources. It could not fetch the papers and talks, so its claim cards rested on memory rather than evidence. This version grounds that map. The sources are now gathered, hashed, and sealed, so each claim card cites a real item that anyone can re-verify against the corpus. The honest result: six cards are fully grounded, five of ten modules are solidly evidence-backed, and the gaps are named rather than papered over.

The corpus, sealed

Twenty items were gathered, four were dropped, and sixteen were kept, digested, and stored in a content-addressed object store. Each item lives under its own sha256, so the hash printed on a card is also the key that retrieves the item. The two seals below let any reader re-derive the full inventory and confirm nothing was swapped after the fact.

Corpus
a content-addressed object store of 16 frontier AI sources (10 arXiv papers, 5 talks; one talk carries both a metadata and a transcript object)
Digest seal
0f364c2173e27ee1848ffa87eb2c99ac232f9bfcc144f2d7feec6af37cd33399
Run seal
e1ba77514c2f6b3f714ff23c537c2db614f3c580af5f0cabea25b8c19a3d45b4
Counts
gathered 20, dropped 4, kept and digested 16, stored 16
Scope tags
reasoning, agents, evaluation, reproducibility, interpretability, test-time, coding, benchmark, verifier, world-model, ai-for-science

Each item is stored under objects/<first-2-hex>/<rest-of-sha256>, so the hash in a card is the retrieval key. The seals above are the re-verification handle, not a claim of correctness. Gathering and citing a source is not the same as adjudicating its claim.

The shape of a claim card

Every learning object follows the same schema. Each field does one job: name the concept, point at the hashed source, explain it without jargon, show the smallest thing that demonstrates it, state where it breaks, match it to a more recent source, reduce it to one falsifiable sentence, and leave an empty slot for a separate judge. The two fields that keep the card honest are failure_case and limits: they say out loud what the card does not establish.

id
short slug
concept
one sentence, the thing being taught
source_ref
corpus item id plus sha256 (the retrieval key)
plain_language
intuition with no jargon
minimal_demo
the smallest hand-checkable thing that shows it
failure_case
where it breaks, stated honestly
matched_current
a more recent source that extends or limits it
claim
a falsifiable sentence
crucible_verdict
a slot for a separate judge: status MATCH, DRIFT, or UNVERIFIABLE, plus by and when (empty in this version)
module
one of the 10 modules
limits
what this card does NOT establish

The verdict slot is deliberately empty in this version. Grounding a claim in a hashed source is not the same as adjudicating it. The Forge gathers and cites; a separate judgment organ (Crucible) judges, in a separate step. Leaving the slot empty is the honest state, not an oversight.

One card, in full

Here is a single learning object as it stands in this version, with every field filled except the verdict, which is empty by design.

C1 · Test-time compute can improve reasoning, within scope
concept
Spending more compute at inference (search against a verifier, or letting reinforcement learning incentivize longer reasoning) can lift accuracy on verifiable tasks without a bigger model, but the gain is bounded by problem difficulty and by rising cost.
source_ref
2408.03314v1 11884e3b... (compute-optimal test-time scaling) and 2501.12948v2 d10f13ca... (DeepSeek-R1, reinforcement-learning-incentivized reasoning).
plain_language
Let the model think harder on a hard question and it often does better, but only if you spend the extra thinking where it actually helps, and the help runs out.
minimal_demo
On a problem set, compare one-shot accuracy to best-of-N and to verifier-guided search at matched compute, then plot accuracy per token.
failure_case
2408.03314 states that the effectiveness of each scaling method varies critically with prompt difficulty, and that compute-optimal allocation only beats best-of-N (about 4x more efficient) when difficulty is matched. DeepSeek-R1's gains are on verifiable tasks (math, coding, STEM); a longer chain of thought does not make it faithful.
matched_current
2509.19681v1 041e5150... (Calibrated Reasoning) argues test-time strategies are capped by the models' poor self-evaluation and adds a calibrated explanatory verifier; 2504.13171v1 f732d910... (Sleep-time Compute) moves some cost offline (about 5x less test-time compute for equal accuracy when the query is predictable).
claim
More inference compute raises accuracy on verifiable problems, but the marginal accuracy per token falls and depends on difficulty; it is not a free or unbounded lever.
crucible_verdict
{ status: UNVERIFIABLE, by: -, when: - } · not yet adjudicated
limits
Does not establish gains on non-verifiable or open-ended tasks; does not establish faithful reasoning, only better answers.

The other grounded cards

Five more cards follow the same schema, each tied to a specific hashed item. The summaries below give the concept, the claim, and the limit each card refuses to overstep.

C2 · Coding-benchmark success can be inflated
  • Claim. A benchmark score is evidence of capability only after contamination, retrieval, and the coding-versus-engineering distinction are controlled for.
  • Grounding. SWE-bench (2310.06770v3) reports that at release the best model resolved only 1.96% of 2,294 real issues, so early "coding is solved" framings were not grounded; the Jeremy Howard talk argues current models are weak at software engineering and produce near-copies of existing work.
  • Limits. Does not claim all coding-agent results are contaminated; claims the default reading of a raw score overstates capability.
C3 · Evaluation is a scientific object, not a leaderboard number
  • Claim. A model evaluation reports a distribution and a reliability level against a defined task distribution; a single accuracy number without those is not a scientific result.
  • Grounding. The Beth Barnes and David Rein talk on time-horizon measurement: the dominant uncertainty is generalization to the real world, not the standard error, and a regularization bug shifted a reported 50% horizon by about 35%.
  • Limits. Does not provide a single correct reliability threshold; the right threshold depends on the question being asked.
C4 · AI-for-science needs measurement-first claim cards
  • Claim. An AI-for-science result is credible when it names the experiment it predicts, reports validity against that experiment, and bounds its scope.
  • Grounding. The John Jumper talk is explicit that AlphaFold predicts one narrow category of measurement rather than modeling the cell; the Robert Lange talk shows autonomous discovery often stalls without a reframed surrogate problem.
  • Limits. Does not claim AlphaFold-level validity generalizes to messier biology; Jumper himself says most of biology is still unknown.
C5 · Interpretability gives explanations, not proofs
  • Claim. Interpretability output is an explanation with stated coverage, not a proof of behavior; treat it as evidence, not a verdict.
  • Grounding. 2402.03855v2 argues most mechanistic-interpretability work so far studies trivial, token-aligned behaviors and that established methods are insufficient for hidden representations; later tooling (Equivariant SAEs, Prisma) extends the field while showing its own assumptions still move.
  • Limits. Does not claim interpretability is unusable; claims it is partial and must be labeled as such.
C6 · A model builds abstractions but extrapolates poorly
  • Claim. Pretraining yields useful internal abstractions and strong interpolation; it does not yield reliable out-of-distribution extrapolation.
  • Grounding. The Howard talk frames pretraining as compression into a hierarchy of abstractions that interpolates well but degrades sharply outside the data; the Michael Jordan talk pushes back on the word "understanding" and frames intelligence as collective. Both are cited and the tension is left standing.
  • Limits. The interpolation surface is vast and its true limits are unknown; this card states a direction, not a hard boundary.

Grounded versus ungrounded, counted honestly

The earlier map's ten modules are kept exactly as-is, so the grounding is auditable against the original. Five of ten modules are solidly evidence-backed by at least one specific, on-point item in the sealed corpus. Two are partial: covered only through a single talk, still wanting a primary reference. Three have nothing on-point and are marked as needing sources gathered next. Naming the empty modules is the point, not a footnote.

Grounded (5)
(4) reasoning and test-time compute · (6) coding agents and SWE benchmarks · (7) evaluation and reproducibility · (8) interpretability · (10) AI-for-science
Partial (2)
(2) what a model is · (3) what an LLM is. Both grounded only through the Howard talk's compression-to-abstractions argument; each would benefit from a primary architecture or pretraining paper.
Not covered (3)
(1) programming-as-evidence · (5) agents, tools, and MCP · (9) efficient and alternative compute. The corpus has little or nothing on-point; these are the next things to gather.

Net: 5 of 10 modules are solidly evidence-backed, 2 are partial, and 3 need gathering. Six claim cards are grounded; none are adjudicated. Every verdict slot reads UNVERIFIABLE because judging is a separate organ and a separate step that has not run yet.

First three labs

Each lab is tied to a gathered, hashed source and produces an artifact a judge can later mark MATCH, DRIFT, or UNVERIFIABLE. The success criterion is explicit in every case: "it runs" is never the bar, and a result that contradicts the card is itself a recorded outcome.

L1 · Accuracy-per-token test-time-compute lab
  • On a small verifiable task set, run one-shot, best-of-N, and a simple verifier-guided selection, then plot accuracy versus tokens spent.
  • Success criterion. The curve shows diminishing accuracy per token, and verifier-guided selection beats best-of-N at a matched token budget on the harder slice. If it does not, the C1 claim drifts.
L2 · Benchmark-contamination coding lab
  • Split coding tasks by likely-seen versus likely-unseen relative to a training cutoff, measure resolution rate on each, and inspect solutions for copied structure.
  • Success criterion. A measurable gap between seen and unseen resolution rates, with at least one concrete example of retrieval rather than engineering. A zero gap is itself a recorded result against C2.
L3 · Explanation-is-not-proof interpretability lab
  • Surface an apparently interpretable feature, then construct an input where that feature's reading fails to predict the model's behavior.
  • Success criterion. At least one feature that looks interpretable and at least one case where it mispredicts behavior, both logged. This demonstrates C5 rather than asserting it.

On honesty and method

Every claim card cites a sha256 that resolves to a real object in the sealed corpus, and the two seals let any reader re-derive the inventory. Verdict slots are empty on purpose: gathering and citing is done, judging is a separate organ and a separate step. The earlier map's ten modules are kept as-is so the grounding is auditable against it, and the accounting states plainly which modules the evidence does and does not reach. No card asserts more than its read source supports, and where two sources disagree, both are cited and the tension is left standing rather than resolved. That is the whole discipline: carry a re-checkable proof, never ask to be trusted.

Updated 2026-06-30. v0, grounded in a now-verified corpus.

Public showcase. No secrets, no private repository internals, and no claim beyond what the read sources support. The corpus seals and per-item hashes are the re-verification handle.