Ingestion spec — compiler agent contract

The operational manual for the LLM agent that compiles raw book markdown into wiki pages. Follow this when running an ingest, query, or lint pass.

Prerequisites before running: DESIGN.md, wiki/schema.md.


Operation 1 — INGEST

One book at a time. Never batch-ingest.

Steps

  1. Select a book not yet ingested. Read log.md to confirm. The queue report from python scripts/lint_wiki.py (check 9) ranks pending books by inbound forward-ref count; normally ingest the top entry.
  2. Read the raw source raw/TR-XX-<slug>.md end to end (the only place in the workflow where you use the TR-XX plumbing ID — to locate the file). If the book is long, read in passes.
  3. Report structure to the human: chapter list, main claims, principal entities (concepts, symbols, figures, traditions) introduced or developed, any cross-references to already-ingested or pending books (refer to them by Arabic title, not TR-XX). Wait for human to confirm or correct.
  4. Create the book page at wiki/books/<slug>.mdpure slug filename, no TR-XX- prefix. Use the slug from manifest.tsv column 2.
    • Frontmatter per schema.md § book: title, title_fr, type: book, aliases, author, translator, source_file: raw/TR-XX-<slug>.md (the only line on the page where TR-XX may appear), chapters, updated. Do not set books: on a book page.
    • Body: brief bio-bibliographic note, book’s argument in one paragraph, chapter-by-chapter summary (2-4 sentences each), thematic map, list of entities introduced, ## نصوص الشيخ عبد الباقي مفتاح for verbatim Meftah excerpts when present, ## قراءة الموسوعة لتعليقات الشيخ عبد الباقي مفتاح for editor synthesis of Meftah’s distinctive contributions (kept segregated from Guénon’s claims), ## شواهد من الكتب with 2-4 verbatim quotations.
  5. For each entity that the book introduces or substantially develops:
    • If the page doesn’t exist → create it at wiki/<dir>/<slug>.md per schema.
    • If it exists → update body: add this book’s treatment, append a wiki-link string to the books: frontmatter listbooks: ["[[books/<existing-slug>|<existing title>]]", "[[books/<new-slug>|<new title>]]"]. Never add a bare TR-XX token to books:. Add new aliases/synonyms to aliases:. Add a section ### عند غينون في [[books/<new-slug>|<Arabic title>]] quoting or summarising this book’s angle.
  6. Wire backlinks. Every [[link]] must resolve. Every book’s entity-list must match what the entity pages claim.
  7. Update index.md. Add lines for newly created pages under the right category heading.
  8. Run the linter: python scripts/lint_wiki.py. All 8 checks must return OK before declaring the ingest done. The only expected non-zero line is “intentional forward-refs to future books” — that’s the running count of book-pages not yet ingested. If there are no forward-refs, the queue may still report manifest books not yet ingested but without inbound queue pressure. If any check returns FAIL, fix before logging.
  9. Run the quote provenance checker: python scripts/check_quote_provenance.py. Treat failures as manual-review blockers: either fix the quote against raw/, move non-source prose out of ## نصوص الشيخ عبد الباقي مفتاح, or record why the quote is intentionally outside the raw corpus.
  10. Append log.md entry (format below).
  11. Report touched pages to the human. A typical book ingest touches 15-60 pages.

Quality bar per book ingest

  • 0 broken [[links]].
  • 0 invented citations (every quotation verbatim from raw/).
  • 0 blended Meftah/editor voice in source-text sections. Use ## نصوص الشيخ عبد الباقي مفتاح only for verbatim material; use ## قراءة الموسوعة لتعليق الشيخ عبد الباقي مفتاح for synthesis.
  • Arabic register consistent with Meftah’s prose.
  • No entity left with only type: set but no body.

Do NOT during ingest

  • Paraphrase quotations into pseudo-citations.
  • Put editor synthesis under a heading that implies it is Meftah’s own wording.
  • Introduce new type: categories without first updating schema.md.
  • Edit raw/*.md to “fix” OCR or translation quirks — note them in the entity page instead.
  • Run consecutive ingests without at least a quick browse by the human.
  • Write TR-XX anywhere a reader sees it. Book pages use pure slug filenames (haymanat-al-kamm-wa-alamat-akhir-al-zaman.md, no prefix). Inline references use the book’s Arabic short title wiki-linked: [[books/<slug>|هيمنة الكمّ]]. Citations say (هيمنة الكمّ، الفصل X), never (TR-01, ...). The books: property stores wiki-link strings, not TR-XX labels. TR-XX appears only in raw/ and manifest.tsv, plus the source_file: line of a book page’s frontmatter — nowhere else. See schema.md § “The TR-XX zone rule”.

Operation 2 — QUERY

Steps

  1. Read index.md. Identify candidate pages from category and title.
  2. Open the candidate pages. For deep questions, follow [[links]] 1-2 hops.
  3. Answer the human in chat, with every claim cited to a wiki page (which in turn cites raw/TR-XX).
  4. File the answer if it’s worth preserving: create wiki/queries/YYYY-MM-DD-<slug>.md, link from relevant entity pages’ ## ارتباطات section.
  5. Append log.md entry.

Filing criterion

File if the question needed multi-hop reasoning, surfaced a new cross-reference, or is likely to be asked again. Don’t file trivial lookups.


Operation 3 — LINT

Run after every 3-5 ingests, or on demand.

Automated checks — run first

python scripts/lint_wiki.py performs eight mechanical checks:

  1. TR-XX leakage in reader-facing zones.
  2. Broken [[wiki-links]] (excluding intentional forward-refs to future books).
  3. Self-links (page linking to itself).
  4. Frontmatter sanity (title, type, updated present on every page).
  5. books: property format (no bare TR-XX).
  6. Orphan pages (entity pages with no inbound links).
  7. Index drift (index.md ↔ filesystem).
  8. Backlink symmetry (entity cites book ⇒ book page links entity).

All eight must return OK. Fix any FAIL; investigate every WARN.

Then run python scripts/check_quote_provenance.py. This is a mechanical provenance assistant for block quotes and Meftah source sections. It is stricter than the wiki linter and may require human judgment, but new unmatched quotes should be fixed or explicitly explained before logging completion.

Manual checks — after the automated pass

  • Citation integrity: run python scripts/check_quote_provenance.py, then sample 10 citations manually and verify verbatim against raw/.
  • Contradictions: scan for pages claiming opposing things about the same entity. Flag in a ## ملاحظات الفحص section on the affected page; don’t auto-resolve.
  • Aliases coverage: for common concepts, all Arabic synonyms listed in aliases.
  • Stale summaries: book page’s entity list matches entities that actually link back (partially covered by check 8, but review the prose too).
  • Tashkīl/tatweel normalisation: verbatim quotes keep source styling; titles and aliases should carry plain forms as alternates so Obsidian search works.

Record the lint result in log.md with counts per issue type. Fix or flag each finding.


Log format

 
## [YYYY-MM-DD] <ingest|query|lint> — <short subject>
 
- operation: <ingest|query|lint>
- book: TR-XX           (ingest only)
- pages_created: <n>
- pages_updated: <n>
- notable: <one-line takeaway>

Append at the bottom of log.md. Never rewrite history.


What to discuss with the human

Always raise:

  • Structural choices a book forces on the taxonomy (new type: needed? existing type needs splitting?).
  • Contradictions between this book and an earlier ingested book.
  • Translations of Guénon’s French terms where Meftah’s choice is ambiguous.

Don’t ask permission for:

  • Routine page creation and linking.
  • Fixing obvious broken links.
  • Adding an alias to a page.