It brings information in, from anywhere, and records how.
gather is the research-intake organ of the constellation. It reaches information from scattered, often awkward sources and returns it as items, each carrying a provenance receipt. It is the one place where network access, third-party tools, and credentials are allowed to live, isolated behind source adapters, so the rest of the constellation stays clean, offline, and deterministic. A run scope-filters what it gathered, folds the receipts into one witnessed digest, and hands that digest downstream to index, refine, and the crucible.
Every kind of intake is one small adapter behind a single shape: one string in, a list of receipted items out. The adapter can use the network, a tool, a credential, a headless browser, whatever the source demands; nothing else in gather imports any of that. Awkward access is an adapter problem, not a system problem, and the core stays pure standard library because of it.
The receipt is the rule. Each item records how it was obtained, so a transcript read from captions, text recognized from a scan, and a fact synthesized from fragments are all valid items, but they are not equal, and they are never confused.Reach anywhere. Then say exactly how.
gather ships a source adapter for each kind of intake, all behind the one shape. The cheap, local sources are pure standard library; the harder ones shell out to their own external tool, never a Python dependency. Each adapter records the method it used, so the accountability is in place before the harder reach is trusted. The method on every item is the load-bearing field: it keeps a direct read and a derived inference from ever being read as the same thing.
yt-dlp; pure parsing behind an impure shell, each item receipted
http-get, exactly that
pdftotext, and from a scanned image via tesseract; a best-effort machine reading, labelled ocr, never dressed up as clean text
transcribe so a listener knows it was machine-made
browser-extract, so you know JS ran. The most exposed edge, see the limit below
The honesty matters most exactly where the reach is widest. The web adapter runs no JavaScript, so it is honest that a client-rendered page gives back only its shell. The browser adapter runs a real headless browser, and it is the most exposed organ in the system: its host guard covers only the first navigation, and the rendered page then follows its own redirects and sub-requests unguarded. So do not point it at untrusted URLs in an environment where internal services are reachable. The browser-extract method is honest that JavaScript ran, but it cannot make the fetch safe. The threat model says so plainly, in the open, rather than hide it.
The receipt is the differentiator.
A provenance receipt rides on every item: the source, the reference, the method, the time, and a sha256 of the item’s own content. Re-hash the content and you can confirm it is what was obtained, unaltered. The method keeps the harder distinction on the record, yt-dlp, browser-extract, ocr, transcribe, synthesized, so a quote pulled straight from a source and an inference pieced together from fragments are both valid items, but they are never equally direct, and the receipt never lets one be read as the other.
A fact pieced together is not a fact pulled straight from the source. The receipt keeps them apart.
A derived item, one assembled or inferred from other items rather than fetched, is the sharp case, and the receipt is built for it. Its sha256 fingerprints the inference itself, not its sources, because it is a new statement that can only witness itself; a derived_from field records the content hash of each input, a re-checkable pointer back to the exact source content. The honesty is mechanical where it can be: the synthesized label is reachable only through one seam, and the bare builder refuses to stamp it and defaults to compiled, so a plain call can never forge a synthesis. With no model wired in, the default compiles the inputs verbatim and invents nothing. The digest seal folds in the method and the inputs alongside the hash, so relabelling an inference as a direct fetch, or quietly rewriting what it was built from, breaks the seal exactly as altering the content does.
Standard library at the core. The impurity is fenced at the edge.
The core runs on Python’s standard library with no third-party runtime dependency. An adapter may pull in whatever its source demands, a browser binary, an OCR engine, a transcription CLI, but only behind the Source shape and only at the impure edge. Clocks are injected everywhere a time is recorded, so given the same fetched items a run is deterministic and replayable, and verification stays trustworthy. The four stages below are the spine of a run, each one re-checkable from disk.
Source shape, the only place the network, a tool, or a credential is allowed
corpus verify re-hashes every stored body
A full gather run folds fetch, scope, optional synthesis, digest, and store into one re-checkable RunRecord with its own seal over what it did. Recall queries a stored corpus and re-verifies every body it returns, so a missing or corrupt item is reported, never handed back as if intact. The scope filter, the synthesizer, the store, and the external provenance verdict are all seams that default to a Null, so gather stands alone and a peer organ plugs in without the core ever importing it.
Current state. How to run it.
Version 1.5.0, on PyPI as gather-engine. It installs the gather command and the gather package. The core is pure standard library; a few adapters call an external tool (yt-dlp, pdftotext, a headless chromium, tesseract, a Whisper-style CLI), which you install only if you use that adapter. gather reached its organic completion at 1.5.0: every planned source and seam is shipped, and the accountability claims hold end to end across a final whole-system review. The license is fair-source: source-available, rights reserved to fund the research; the smaller utility tools that came out of this work stay permissive.
$ pip install gather-engine $ python examples/demo.py # one video parsed, scoped, digested, then a receipt catches tampering parsed 3 items from one video, each with a receipt witnessed digest: 3 receipts, verified True after tampering one receipt, digest verifies: False <- caught $ python examples/pipeline.py # the whole organ: run, store, verify, recall, offline
Python, standard library only. Install it, run the demo offline, then point an adapter at a real source. Each example under examples/ is a short demonstration of one capability: the receipt that catches tampering, the corpus that re-hashes every body, the recall that re-verifies what it returns. If an adapter reaches something it should not, or a digest produces a result you cannot verify, that is exactly the kind of report I want.