Karatage Brain — System Design Proposal

The Problem

Karatage's deal knowledge is trapped in a large OneDrive folder structure. Thousands of documents — term sheets, financial statements, board resolutions, DD reports, NDAs — sit in company-grouped folders. Getting an answer means digging through folders manually.

What’s Missing

✗ No single source of truth across deals
✗ No way to ask questions without manual search
✗ No structured view of deal pipelines
✗ No audit trail of how information was derived

What We Want

✓ System watches OneDrive automatically
✓ Reads & understands every document
✓ Builds a queryable knowledge base
✓ Answers questions via WhatsApp

Solution Overview

Karatage Brain is purpose-built for PE/VC deal operations. It watches your existing OneDrive workflow, reads every document, builds structured state, and lets the team query it naturally via WhatsApp or a web console.

flowchart LR
    OD["OneDrive\n(Source of Truth)"]
    Mirror["Local Mirror"]
    Pipeline["Pipeline\n6 Stages"]
    DB["PostgreSQL\n+ pgvector\n(The Brain)"]
    WA["WhatsApp"]
    Web["Web Console"]
    CLI["CLI"]

    OD -->|"Microsoft Graph\nDelta Sync"| Mirror
    Mirror -->|"6-stage pipeline"| Pipeline
    Pipeline --> DB
    DB --> WA
    DB --> Web
    DB --> CLI

    style OD fill:#e0effe,stroke:#0073c5,color:#064e84
    style DB fill:#f0fdf4,stroke:#16a34a,color:#166534
    style Pipeline fill:#fef9c3,stroke:#ca8a04,color:#854d0e

Core Principle

Watch-and-surface, operator-decides. The system reads documents and proposes structured state. It never silently fabricates data. When it’s uncertain, it surfaces a proposal for a human to resolve.

Architecture

graph TB
    subgraph office["Office (Private Network)"]
        subgraph app["Application Server"]
            subgraph docker["Docker Compose"]
                pg["PostgreSQL 16\n+ pgvector\n+ pg_trgm"]
                parser["Parser\nSidecar"]
                api["API\n(FastAPI)"]
                worker["Worker\n(Procrastinate)"]
                web["Web UI\n(Next.js)"]
                waha["WAHA\n(WhatsApp)"]
            end
            mirror["OneDrive Mirror"]
            cli["brain CLI"]
        end
        ai["AI Server\n(vLLM / Ollama)\nLocal GPU inference"]
        worker --> ai
        api --> ai
    end

    onedrive["OneDrive (M365)"]
    entra["Entra IDP"]
    cf["Cloudflare Access"]
    users["Team"]

    onedrive -->|"Graph API\nDelta Queries"| mirror
    mirror --> worker
    cli --> pg
    api --> pg
    worker --> pg
    worker --> parser
    api --> parser
    web --> api
    waha --> api

    users --> cf
    cf --> entra
    cf --> web

    style pg fill:#d1fae5,stroke:#059669
    style ai fill:#fef9c3,stroke:#ca8a04
    style docker fill:#f0f9ff,stroke:#0284c7
    style app fill:#fafafa,stroke:#d4d4d4
    style office fill:#f0fdf4,stroke:#16a34a

Service Responsibilities

🗄️

PostgreSQL

All state. Typed schema + pgvector embeddings + full-text search. Single database, no external deps.

📄

Parser Sidecar

Stateless. Converts bytes → text. PDF, Office, images. Vision LLM for OCR on scanned docs.

⚡

API (FastAPI)

HTTP API. Serves web UI, WhatsApp webhook, RAG queries. Enqueues async jobs.

⚙️

Worker

Procrastinate async jobs. Pipeline stages, OneDrive sync, embedding backfills.

🌐

Web (Next.js)

Operator console. Dashboard, entity browsers, pipeline queues, ask-the-brain.

📱

WAHA

Self-hosted WhatsApp gateway. One dedicated number. Bridges WhatsApp ↔ HTTP.

🧠

AI Server

Dedicated GPU hardware on the private network. Serves all LLM inference (classify, extract, embed, OCR, RAG). OpenAI-compatible API.

The Ingestion Pipeline

Six stages transform unstructured documents into typed, queryable state. Each stage is independently cached and re-runnable.

flowchart TD
    mirror["OneDrive Mirror"] --> discover

    discover["DISCOVER\nSHA-256 hash = identity\nNew hash? Create row.\nKnown hash? Skip."]
    parse["PARSE\nParser sidecar: PDF, DOCX,\nXLSX, images to text chunks\nwith page locators"]
    classify["CLASSIFY\nLLM scores document against\nall known types.\nThreshold: min 0.80, margin min 0.20"]
    adjudicate["ADJUDICATE\nPair-specific LLM call\nto disambiguate"]
    extract["EXTRACT\nPer-type structured extraction\nLLM to Pydantic model to JSON\nwith per-field confidence"]
    resolve["RESOLVE\nMap names/IDs to canonical\ndatabase entities.\nAmbiguous = proposal"]
    apply["APPLY\nStratified fixpoint loop.\nPreconditions + effects.\nLoop until convergence."]
    unknown["UNKNOWN\nOperator triage queue"]

    discover --> parse --> classify
    classify -->|"Confident"| extract
    classify -->|"Below threshold"| unknown
    classify -->|"Margin too narrow"| adjudicate
    adjudicate -->|"Resolved"| extract
    adjudicate -->|"Still unclear"| unknown
    extract --> resolve --> apply

    style discover fill:#e0effe,stroke:#0073c5
    style parse fill:#e0effe,stroke:#0073c5
    style classify fill:#fef9c3,stroke:#ca8a04
    style adjudicate fill:#fef3c7,stroke:#d97706
    style extract fill:#d1fae5,stroke:#059669
    style resolve fill:#d1fae5,stroke:#059669
    style apply fill:#d1fae5,stroke:#059669
    style unknown fill:#fee2e2,stroke:#dc2626

Stage Caching (Make-Style Invalidation)

Each stage writes output to columnar caches on the documents row. Re-runs only when the version changes.

Stage	Cached Columns	Version Key	Invalidation Trigger
Parse	`parsed_at`, `parser_version`, `parse_output`	PARSER_VERSION constant	Bump the constant
Classify	`classified_at`, `classifier_version`, `document_type`	SHA-256 of all classifier profiles	Edit any profile
Extract	`extracted_at`, `extractor_version`, `extraction_result`	SHA-256 of per-type profile	Edit the type’s profile
Apply	`application_status`	(none — runs every eligible doc)	Every fixpoint pass

Document Identity: SHA-256 Content Hash

Every document is identified by the SHA-256 hash of its file bytes — not its filename or path. A renamed file is the same document. A file copied to a different folder is the same document. Weekly full syncs skip all unchanged files. Deduplication is automatic.

OneDrive Integration

sequenceDiagram
    participant Op as Operator
    participant CLI as brain CLI
    participant Graph as Microsoft Graph API
    participant Mirror as Local Mirror
    participant DB as PostgreSQL
    participant Pipeline as Pipeline

    Op->>CLI: brain sync
    CLI->>DB: Load delta token
    DB-->>CLI: token (or null for first sync)

    alt First sync (no token)
        CLI->>Graph: GET /delta (full tree)
        Graph-->>CLI: All files + deltaLink
    else Incremental sync
        CLI->>Graph: GET {deltaLink}
        Graph-->>CLI: Changed files + new deltaLink
    end

    loop Each changed file
        CLI->>Graph: Download file
        Graph-->>CLI: File bytes
        CLI->>Mirror: Write to mirror/
    end

    CLI->>DB: Store new delta token
    CLI->>Pipeline: Trigger ingestion on changed files
    Pipeline->>DB: discover, parse, classify, ...

Why Manual Trigger?

The OneDrive corpus changes slowly — a few documents per week, not per minute. Continuous polling would be complexity without payoff.

✓ Operator runs brain sync from CLI or WhatsApp
✓ Delta queries ensure only changed files download
✓ Content hash catches anything delta missed
✓ Can upgrade to webhook-driven sync later

Resilience

🛡️ Crash recovery: Delta token persists. Next run picks up where it left off.
🛡️ Download failure: File skipped, retried next cycle.
🛡️ Token expiry (~90 days): Falls back to full sync. Content hash prevents wasted reprocessing.
🛡️ OneDrive is source of truth: Local mirror is read-only derivative.

Entity Model

The typed schema captures the PE/VC deal domain. Core entities are linked by relationship tables that model the real-world connections.

erDiagram
    FUND ||--o{ INVESTMENT : "invests via"
    DEAL ||--o{ INVESTMENT : "facilitates"
    COMPANY ||--o{ INVESTMENT : "receives"

    FUND {
        uuid id PK
        string name
        int vintage_year
        numeric fund_size
        string status
    }

    DEAL ||--o{ DEAL_PARTY : "involves"
    DEAL ||--o{ DEAL_MILESTONE : "tracks"
    DEAL ||--o{ CONTRACT : "has"
    DEAL ||--o{ DD_ITEM : "requires"
    DEAL }o--|| COMPANY : "targets"

    DEAL {
        uuid id PK
        string name
        string deal_type
        string status
        numeric deal_value
        date signed_on
    }

    COMPANY ||--o{ FINANCIAL_DATA : "reports"
    COMPANY ||--o{ DEAL_PARTY : "participates"
    CONTACT ||--o{ DEAL_PARTY : "participates"
    CONTACT }o--o| COMPANY : "works at"

    COMPANY {
        uuid id PK
        string name
        string registration_number
        string country
        string industry
    }

    CONTACT {
        uuid id PK
        string full_name
        string email
        string phone
        string role_type
    }

Canonical Identity Challenge

Unlike regulatory compliance (where entities have government-issued IDs), the PE/VC domain has weaker identifiers:

Entity	Strong ID	Fallback	Resolution Strategy
Company	Registration number	Name + country	Operator confirms via proposal
Contact	Email address	Name + company affiliation	Operator confirms via proposal
Deal	—	Operator-assigned slug	Folder name as strong hint
Fund	—	Operator-assigned name	Exact match only

The system never guesses identity. When it can’t match with confidence, it creates a proposal for the operator to resolve. This prevents the worst outcome: silently linking the wrong entities.

Discovery-First Schema Design

This entity model is hypothesised. Phase 0 includes a deliberate discovery step: explore the actual OneDrive corpus, classify sample documents, identify what entities and document types exist, and design the schema from evidence — not speculation.

Classification & Extraction

Classification

The classifier determines what kind of document each file is. It uses markdown profiles — one per document type — assembled into a single LLM prompt.

# term_sheet.md

What it is: Binding or non-binding offer

Signals: “term sheet”, “indicative offer”, purchase price, conditions precedent

Distinguish from: LOI, SPA, MOU

Decision Gate

✓

Classified

Score ≥ 0.80 AND gap to 2nd ≥ 0.20

⚖️

Adjudicator

Margin too narrow → pair-specific disambiguation

❓

Unknown

Below threshold → operator triage queue

▶ The Adjudicator — How It Resolves Close Calls

When the classifier can’t decide between two types (both score high, but the gap is too narrow), a specialised adjudicator fires for that specific pair.

flowchart LR
    C["Classifier Output\nbank_confirmation: 0.82\nproof_of_address: 0.75\nGap: 0.07 < 0.20"]
    A["Adjudicator\nbank_vs_proof_of_address\n\n'Does this confirm banking\ndetails or merely an address?'"]
    R["Resolved:\nbank_confirmation (0.91)"]

    C -->|"Margin fail"| A -->|"Sharp question\nsharp answer"| R

    style A fill:#fef3c7,stroke:#d97706
    style R fill:#d1fae5,stroke:#059669

Why not just make the classifier better? The classifier sees ~30 types simultaneously. Adjudicators see exactly two. Sharper question → sharper answer. Cheaper and more reliable than one mega-prompt handling every pairwise confusion.

Four Document-Type Shapes

Not every document needs full extraction. Four levels of processing:

ignore Classify only

Marketing brochure, duplicate cover letter. Not worth processing.

link-only Classify + extract IDs + link

NDA (link to company), CV (link to contact). Shows on entity profiles. 30 min to add.

capture-only Full extract, no mutations

DD report (findings stored as JSON, not typed rows). Visible + searchable, no schema change.

fully typed Extract + resolve + apply

Term sheet → creates Deal + Contract + Milestones. Full pipeline. ~1 day to add.

The Stratified Fixpoint

The Problem: Out-of-Order Documents

Documents arrive in unpredictable order. A board resolution approving a deal might be ingested before the term sheet that creates the deal record. Naïve approaches (ordered processing, retry queues) add complexity and don’t converge.

The Solution

Borrowed from Datalog evaluation. Each applier declares preconditions and effects. The apply stage runs in a loop until convergence.

flowchart TD
    subgraph R1["Round 1"]
        ts1["Term Sheet\nPrecondition: none\nCreates Deal 'Acme'\nCreates Company 'Acme Corp'"]
        br1["Board Resolution\nPrecondition: Deal exists\nDeal doesn't exist yet\nStatus: PENDING"]
    end

    subgraph R2["Round 2"]
        br2["Board Resolution\nPrecondition: Deal exists\nDeal now exists!\nCreates Milestone 'board_approval'"]
    end

    subgraph R3["Round 3"]
        conv["No changes = CONVERGED"]
    end

    R1 --> R2 --> R3

    style ts1 fill:#d1fae5,stroke:#059669
    style br1 fill:#fef3c7,stroke:#d97706
    style br2 fill:#d1fae5,stroke:#059669
    style conv fill:#e0effe,stroke:#0073c5

Self-Healing

New documents that satisfy blocked preconditions automatically unlock blocked docs in the next pass. No manual intervention.

Convergence

Max N passes (default 10). Each pass only adds state (monotonic). No changes = converged. Still pending after convergence = blocked.

Status Machine

extracted → pending → applied | blocked | rejected

RAG — Answering Questions

sequenceDiagram
    participant U as User (WhatsApp/Web)
    participant P as Phase 1: PLANNER
    participant D as Phase 2: DISPATCH
    participant S as Phase 3: SYNTHESIZER

    U->>P: "What's the status of the Acme deal?"

    Note over P: LLM call #1 (JSON mode)
    P->>P: Maps to tool: deal_status
    P->>P: Params: {kind: "deal", name: "Acme"}

    P->>D: {intent, tool, params}

    Note over D: Deterministic code (no LLM)
    D->>D: Resolve "Acme" → Deal ID
    D->>D: Run deal_status tool → structured data
    D->>D: Retrieve top-k chunks (vector + lexical)

    D->>S: tool_results + retrieved_chunks

    Note over S: LLM call #2 (plain text)
    S->>S: Write grounded answer
    S->>S: Cite only provided documents

    S->>U: "The Acme acquisition is in due diligence.\nTerm sheet signed 2026-03-15 for R50M.\n[doc:a1b2c3]"

Why Three Phases?

Approach	Problem
Single LLM call	Can’t do structured lookups. Hallucinates data. No verifiable citations.
LLM + function calling	Model picks wrong tools, retries burn tokens, latency spikes.
Planner → Code → Synthesizer	Each phase is constrained. Planner can’t access data. Dispatch is deterministic. Synthesizer only writes from provided facts.

Available Tools

deal_status

Stage, value, key dates

deal_parties

Companies & contacts with roles

deal_timeline

Milestones: done, pending, overdue

company_profile

Registration, industry, deals

fund_portfolio

All investments in a fund

financials_for

Revenue, EBITDA, valuations

dd_status

DD progress, findings, risk

contracts_for

Agreements: type, status, dates

whats_outstanding

Pending items, overdue milestones

▶ Hybrid Retrieval: Vector + Lexical (RRF)

Retrieval combines two search modes via Reciprocal Rank Fusion — neither vector nor keyword search alone is sufficient.

flowchart LR
    Q["Query: 'Acme financials'"]
    V["Vector Search\nEmbed query,\ncosine similarity\nover chunks"]
    L["Lexical Search\nFull-text match\non document_type + path"]
    RRF["Reciprocal Rank\nFusion (RRF)\nscore = sum 1/(K+rank)\nK = 60"]
    R["Top-k chunks\nwith doc IDs\n+ page locators"]

    Q --> V & L
    V & L --> RRF --> R

    style RRF fill:#e0effe,stroke:#0073c5

Entity-scoped retrieval: When the planner resolves a specific entity, search is scoped to documents linked to that entity via document_subjects. Prevents leakage between deals.

WhatsApp Integration

sequenceDiagram
    participant WA as WhatsApp User
    participant WAHA as WAHA Gateway
    participant API as Brain API
    participant RAG as RAG Agent
    participant DB as PostgreSQL

    WA->>WAHA: Send message
    WAHA->>API: POST /whatsapp/webhook (HMAC-SHA512)
    API->>API: Verify HMAC signature
    API->>DB: Resolve sender → known contact
    API->>DB: Check wa_operators allowlist

    alt Authorized operator
        API->>RAG: Process question
        RAG->>DB: Plan → Dispatch → Retrieve
        RAG-->>API: Grounded answer + citations
        API->>WAHA: Send reply
        WAHA->>WA: Deliver answer
    else Unknown sender
        API->>DB: Store message (no answer)
    end

Self-Hosted

WAHA runs locally. No Meta Business API approval. Data stays on our hardware. One dedicated phone number.

Group Intelligence

In group chats, bot only responds when @mentioned. DMs always answered (if authorized).

Media Ingestion

Documents sent via WhatsApp (PDFs, photos of contracts) are ingested into the pipeline.

User Experience

Three surfaces for interacting with the Brain: a desktop web console for operators, a mobile-responsive view for on-the-go access, and WhatsApp for instant Q&A with rich deal summaries.

Desktop — Operator Dashboard

Desktop — Deal Profile

Mobile & WhatsApp

Mobile-responsive dashboard

WhatsApp Q&A with rich deal summaries

Example Interactions

Questions via WhatsApp

“What deals are in due diligence?”

Lists all active DD deals with target companies, values, and days in DD.

“Send me the Phoenix deal summary”

Returns a formatted deal card with status, milestones, parties, and outstanding items.

“What’s outstanding across all deals?”

Aggregates overdue milestones, pending DD items, and expiring contracts across the portfolio.

“Who is the legal advisor on Greenfield?”

Returns the contact and firm with a link to the engagement letter.

Actions via WhatsApp

“Sync”

Triggers an OneDrive sync. Reports back with how many new files were found and processed.

[sends a PDF via WhatsApp]

Document is ingested immediately. Classified, extracted, and linked to the relevant deal.

“What’s blocked?”

Lists documents stuck in the pipeline with reasons (missing entity, ambiguous classification).

“Weekly digest”

Summary of new documents ingested, deals that changed status, and items needing attention.

Auth & Access Control

flowchart LR
    U["Team Members"]
    CF["Cloudflare Access"]
    E["Entra IDP\n(Azure AD)"]
    JWT["JWT Token"]
    API["Brain API"]
    WA["WhatsApp"]
    HMAC["HMAC-SHA512\nVerification"]

    U -->|"Web / API"| CF
    CF --> E
    E --> JWT
    JWT --> API
    WA -->|"Webhook"| HMAC
    HMAC --> API

    style CF fill:#f0f9ff,stroke:#0284c7
    style E fill:#e0effe,stroke:#0073c5

Web & API

Cloudflare Access with Entra IDP (Azure AD). Zero-trust. No VPN. Existing Microsoft identity. Group-based access control.

Separate path. HMAC-verified webhooks. Phone number allowlist (wa_operators). No Cloudflare in this path.

Tech Stack

Layer	Technology	Why
Backend	Python 3.12 / FastAPI	Async-first. Strong LLM ecosystem. Production-grade.
Database	PostgreSQL 16 + pgvector	One database for everything: schema, vectors, FTS, job queue.
Job Queue	Procrastinate	Postgres-native. No Redis/RabbitMQ. Transactional guarantees.
Frontend	Next.js 15 / React / Tailwind	Server components. Radix UI. Fast iteration.
WhatsApp	WAHA (self-hosted)	Free core. No Meta approval. Self-hosted = data stays local.
LLM	Self-hosted (vLLM / Ollama)	All inference on local hardware. No data leaves the network. Gemma / Llama class models for classify/extract. Local embedding model for vectors.
OneDrive	Microsoft Graph SDK	Official SDK. Delta queries. Client credentials flow.
Auth	Cloudflare Access + Entra	Zero-trust. Existing Microsoft identity.
Deploy	Docker Compose / self-hosted server	Simple. Office server (Mac mini or similar). Data never leaves the building. No cloud infra.

▶ Why Not…?

Alternative	Why Not
Cloud (AWS/GCP)	Overkill. Single team. Office server sufficient. Data stays local.
Pinecone / Weaviate	pgvector is fine for <100K chunks. One less service.
Celery / Redis	Procrastinate uses Postgres. One less service.
LangChain	Too abstract. Direct LLM calls are simpler and debuggable.
Cloud LLM APIs (OpenAI, etc.)	Deal documents are confidential. Local inference = zero data exfiltration risk. No per-token cost at scale.
Fine-tuned models	Prompt-based is sufficient. Profiles editable by operators.

Local AI Inference

All AI inference runs on a dedicated server inside the office. No document text, no extracted data, and no queries ever leave the private network.

flowchart LR
    subgraph office["Office Private Network"]
        app["Brain\n(Application Server)"]
        ai["AI Server\n(GPU)"]
        app -->|"classify / extract\n/ embed / OCR"| ai
        ai -->|"structured output"| app
    end

    internet["Public Internet"]
    app -.->|"OneDrive sync only\n(file download)"| internet

    style office fill:#f0fdf4,stroke:#16a34a
    style ai fill:#fef9c3,stroke:#ca8a04
    style internet fill:#fee2e2,stroke:#dc2626

Zero Data Exfiltration

Deal documents are confidential. With local inference, document text is never sent to a third-party API. The only outbound traffic is OneDrive file downloads and (optionally) model weight updates.

AI Server Roles

Role	Model Class	Serving	Notes
Classifier	Gemma 3 27B / Llama 3.1 8B	vLLM or Ollama	Scores documents against type profiles. ~30 types per prompt.
Extractor	Gemma 3 27B / Llama 3.1 70B	vLLM or Ollama	Per-type structured extraction. JSON mode output.
Embedder	BGE / Nomic Embed / E5	Sentence Transformers	768-dim vectors for document chunks. Batch processing.
OCR	Gemma 4 / Llama Vision	vLLM	Vision model for scanned documents and images.
RAG Synthesizer	Gemma 3 27B / Llama 3.1 70B	vLLM or Ollama	Generates grounded answers from retrieved context.

Hardware Options

recommended Mac Studio / Mac Pro (Apple Silicon)

M2 Ultra or M4 Max with 192GB+ unified memory. Runs 70B models comfortably via MLX or Ollama. Silent. Low power. Fits on a shelf.

Unified memory = no GPU VRAM bottleneck. Entire model in memory.

alternative Linux Server + NVIDIA GPU

RTX 4090 (24GB) or A6000 (48GB). vLLM with CUDA. Higher throughput for batch workloads. Standard MLOps tooling.

Requires active cooling. Higher power draw. Better batch throughput.

Privacy

No API calls. No data leaving the network. Full control over model versions and behavior.

Cost

One-time hardware investment. No per-token charges. Thousands of documents processed for the cost of electricity.

Flexibility

Swap models freely. Test new releases same day. No vendor lock-in. OpenAI-compatible API (vLLM/Ollama) means the application code doesn’t change.

Key Design Decisions

▶ 1. Content-Hash Document Identity

Decision: SHA-256 of file bytes is the unique identifier, not filename or path.

Why: Files get renamed, moved, duplicated. “Term Sheet v2 FINAL (2).pdf” is the same document as “Term Sheet v2 FINAL.pdf” if the bytes are identical. Hash identity = zero wasted reprocessing on renames, automatic deduplication.

▶ 2. Watch-and-Surface, Operator-Decides

Decision: The system proposes, never acts unilaterally on uncertain data.

Why: In PE/VC, linking the wrong entity to a deal has real consequences. Three operator queues:

Blocked — preconditions not met (self-healing when resolved)
Proposals — uncertain entity resolutions needing human judgement
Errors — pipeline failures needing technical attention

▶ 3. Markdown Classifier Profiles (Not Code)

Decision: Document type signals described in markdown files, not Python code.

Why: Operators can edit profiles without touching code. Adding a new document type is a 30-minute task. Edits auto-trigger reclassification. Fastest path from “we found a new doc type” to “the system handles it.”

▶ 4. Stratified Fixpoint (Not Retry Queues)

Decision: One mechanism (fixpoint loop) handles all temporal dependencies.

Why: Retry queues are ad-hoc — per-type retry logic, dead-letter queues, manual reprocessing. The fixpoint is one mechanism: out-of-order docs, missing references, cascading creation. Self-healing. Convergence guaranteed.

▶ 5. Three-Phase RAG (Not Single-Call)

Decision: Separate planning, data retrieval, and answer synthesis into three phases.

Why: Single-call RAG hallucinates. It invents data that sounds right but isn’t in the documents. Three phases: planner can’t access data, dispatch is deterministic code, synthesizer only writes from provided facts. Citations are verifiable.

▶ 6. OneDrive as Source of Truth

Decision: OneDrive is canonical. Brain is a read-only derivative.

Why: The team already works in OneDrive. Asking them to upload to a separate system = friction that kills adoption. Brain watches their existing workflow — the “ghost in the machine” principle.

▶ 7. Office-Hosted, Air-Gapped from the Internet

Decision: Self-hosted on a server inside the office, isolated from the public internet.

Why: Deal documents are sensitive. The server (Mac mini or similar) lives inside the office on the private network. Data never leaves the building. Docker Compose keeps operations simple. Architecture ports to any Docker host if scale demands change.

▶ 8. Local AI Inference (Not Cloud APIs)

Decision: All LLM inference runs on a dedicated server inside the office. No document text sent to external APIs.

Why: Deal documents are confidential. Cloud LLM APIs mean every document, every query, every extracted fact transits a third party’s infrastructure. Local inference eliminates that risk entirely. One-time hardware cost replaces ongoing per-token charges. Models are swappable without code changes (OpenAI-compatible serving via vLLM/Ollama).

Implementation Phases

Foundation + Discovery

Weeks 1–2

Stand up infrastructure. Explore the corpus. Design entity schema from evidence.

• Repo scaffolding, Docker Compose, Alembic

• Postgres + pgvector + parser sidecar

• OneDrive sync (authenticate, full download)

• Exploration CLI — parse samples, discover types

• Key output: Report on document types & entities found. Schema designed from evidence.

Core Pipeline

Weeks 3–5

Ingest the full corpus through the pipeline.

• Schema migration (typed tables from Phase 0)

• 10–15 classifier profiles

• 5–8 extractors for structured doc types

• 3–5 appliers for state-creating types

• Full pipeline wired: discover → apply

• Delta query sync for incremental updates

RAG + WhatsApp

Weeks 6–7

The team can ask the brain questions via WhatsApp.

• Document chunk embeddings

• 12 deals-domain RAG tools

• RAG answer agent (3-phase)

• WAHA setup + WhatsApp bot

🎯 First value delivery: team asks questions via WhatsApp and gets grounded, cited answers about any deal.

Operator Console

Week 8+

Web UI for reviewing pipeline output and browsing entities.

• Dashboard: deals, companies, pipeline health

• Review queues: blocked / proposals / errors

• Entity browsers: deal, company, contact profiles

• Document viewer with audit trail

Risk & Mitigation

Risk	Impact	Mitigation
OneDrive API issues	Sync breaks	Delta tokens resilient. Full-sync fallback. Content hash = cheap reprocessing.
LLM classification accuracy	Wrong types → wrong extraction	Threshold gate. Operator triage queue. Adjudicators for common confusions.
Entity resolution ambiguity	Wrong entities linked	Never-guess principle. Proposals queue. Canonical IDs where available.
Sensitive data exposure	Deal data leaked	Office-hosted server on private network. All inference local. Cloudflare + Entra auth. HMAC webhooks.
Server failure	System down	Docker volumes are only state. Backup + restore on any Docker host. pg_dump.
Corpus too large	First ingest takes days	8–24 parallel docs. Content hash skip. Process incrementally by folder.
Schema wrong	Rework needed	Phase 0 discovery = schema from evidence, not speculation. Migrations evolve.