Zain Dana Harpergather · accountable research-intake

Reach it anywhere. Record how you got it.

Research lives behind awkward access. Captions and comments on a video, a paper behind an arXiv gate, a page that renders only under JavaScript, a scanned PDF, an audio file, an API behind a credential wall, and a fact that exists only across scattered fragments and has to be put together. Most tools handle one of those and break on the rest, and when they do reach something, you cannot tell later whether a line was pulled straight from the source or pieced together along the way. gather handles all of it, and writes down how.

Every item carries a receipt recording how it was obtained, so a quote is never confused with an inference, and what was hard to get is never dressed up as if it were lying in the open.

gather · v1.5.0 · provenance receipts · fair-source · pip install gather-engine

It brings information in, from anywhere, and records how.

gather is the research-intake organ of the constellation. It reaches information from scattered, often awkward sources and returns it as items, each carrying a provenance receipt. It is the one place where network access, third-party tools, and credentials are allowed to live, isolated behind source adapters, so the rest of the constellation stays clean, offline, and deterministic. A run scope-filters what it gathered, folds the receipts into one witnessed digest, and hands that digest downstream to index, refine, and the crucible.

Every kind of intake is one small adapter behind a single shape: one string in, a list of receipted items out. The adapter can use the network, a tool, a credential, a headless browser, whatever the source demands; nothing else in gather imports any of that. Awkward access is an adapter problem, not a system problem, and the core stays pure standard library because of it.

The receipt is the rule. Each item records how it was obtained, so a transcript read from captions, text recognized from a scan, and a fact synthesized from fragments are all valid items, but they are not equal, and they are never confused.

Reach anywhere. Then say exactly how.

gather ships a source adapter for each kind of intake, all behind the one shape. The cheap, local sources are pure standard library; the harder ones shell out to their own external tool, never a Python dependency. Each adapter records the method it used, so the accountability is in place before the harder reach is trusted. The method on every item is the load-bearing field: it keeps a direct read and a derived inference from ever being read as the same thing.

video captions, metadata, and comments via yt-dlp; pure parsing behind an impure shell, each item receipted
web the static HTML a server returns, no JavaScript; a client-rendered page yields only its shell, and the receipt says http-get, exactly that
feed · docs RSS or Atom feeds, and local files or a directory of them; both pure standard library, the filesystem the only impure edge
arxiv papers from the arXiv API by id or query; a pure parser, the item carrying the abstract and metadata
pdf · ocr text from a local PDF via pdftotext, and from a scanned image via tesseract; a best-effort machine reading, labelled ocr, never dressed up as clean text
transcribe a transcript from audio via a Whisper-style CLI; a machine transcription, labelled transcribe so a listener knows it was machine-made
api an authenticated JSON API; the token is read from the environment, sent as a header, and never written to a receipt, a URL, or the disk
browser JavaScript-rendered pages via a real headless browser; the receipt records browser-extract, so you know JS ran. The most exposed edge, see the limit below

The honesty matters most exactly where the reach is widest. The web adapter runs no JavaScript, so it is honest that a client-rendered page gives back only its shell. The browser adapter runs a real headless browser, and it is the most exposed organ in the system: its host guard covers only the first navigation, and the rendered page then follows its own redirects and sub-requests unguarded. So do not point it at untrusted URLs in an environment where internal services are reachable. The browser-extract method is honest that JavaScript ran, but it cannot make the fetch safe. The threat model says so plainly, in the open, rather than hide it.

The receipt is the differentiator.

A provenance receipt rides on every item: the source, the reference, the method, the time, and a sha256 of the item’s own content. Re-hash the content and you can confirm it is what was obtained, unaltered. The method keeps the harder distinction on the record, yt-dlp, browser-extract, ocr, transcribe, synthesized, so a quote pulled straight from a source and an inference pieced together from fragments are both valid items, but they are never equally direct, and the receipt never lets one be read as the other.

A fact pieced together is not a fact pulled straight from the source. The receipt keeps them apart.

A derived item, one assembled or inferred from other items rather than fetched, is the sharp case, and the receipt is built for it. Its sha256 fingerprints the inference itself, not its sources, because it is a new statement that can only witness itself; a derived_from field records the content hash of each input, a re-checkable pointer back to the exact source content. The honesty is mechanical where it can be: the synthesized label is reachable only through one seam, and the bare builder refuses to stamp it and defaults to compiled, so a plain call can never forge a synthesis. With no model wired in, the default compiles the inputs verbatim and invents nothing. The digest seal folds in the method and the inputs alongside the hash, so relabelling an inference as a direct fetch, or quietly rewriting what it was built from, breaks the seal exactly as altering the content does.

Standard library at the core. The impurity is fenced at the edge.

The core runs on Python’s standard library with no third-party runtime dependency. An adapter may pull in whatever its source demands, a browser binary, an OCR engine, a transcription CLI, but only behind the Source shape and only at the impure edge. Clocks are injected everywhere a time is recorded, so given the same fetched items a run is deterministic and replayable, and verification stays trustworthy. The four stages below are the spine of a run, each one re-checkable from disk.

fetch each adapter returns receipted items behind the one Source shape, the only place the network, a tool, or a credential is allowed
scope a deterministic, order-preserving filter keeps what serves the work, drops the rest, and records how many it dropped
digest the items’ receipts fold into one re-checkable seal, order-independent across items but covering every field of each receipt
store an optional content-addressed corpus dedups bodies by hash, keeps every distinct receipt, and corpus verify re-hashes every stored body

A full gather run folds fetch, scope, optional synthesis, digest, and store into one re-checkable RunRecord with its own seal over what it did. Recall queries a stored corpus and re-verifies every body it returns, so a missing or corrupt item is reported, never handed back as if intact. The scope filter, the synthesizer, the store, and the external provenance verdict are all seams that default to a Null, so gather stands alone and a peer organ plugs in without the core ever importing it.

Current state. How to run it.

Version 1.5.0, on PyPI as gather-engine. It installs the gather command and the gather package. The core is pure standard library; a few adapters call an external tool (yt-dlp, pdftotext, a headless chromium, tesseract, a Whisper-style CLI), which you install only if you use that adapter. gather reached its organic completion at 1.5.0: every planned source and seam is shipped, and the accountability claims hold end to end across a final whole-system review. The license is fair-source: source-available, rights reserved to fund the research; the smaller utility tools that came out of this work stay permissive.

$ pip install gather-engine
$ python examples/demo.py       # one video parsed, scoped, digested, then a receipt catches tampering
parsed 3 items from one video, each with a receipt
witnessed digest: 3 receipts, verified True
after tampering one receipt, digest verifies: False   <- caught
$ python examples/pipeline.py   # the whole organ: run, store, verify, recall, offline

Python, standard library only. Install it, run the demo offline, then point an adapter at a real source. Each example under examples/ is a short demonstration of one capability: the receipt that catches tampering, the corpus that re-hashes every body, the recall that re-verifies what it returns. If an adapter reaches something it should not, or a digest produces a result you cannot verify, that is exactly the kind of report I want.

GitHubHarperZ9/gather  ·  PyPIgather-engine  ·  ← all flagships  ·  the index