System Design Proposal

Karatage Brain

A knowledge system for PE/VC deal operations.
Ingest documents. Build structured state. Answer questions.

Author: Luke Venediger June 2026 Pending CTO Review

The Problem

Karatage's deal knowledge is trapped in a large OneDrive folder structure. Thousands of documents — term sheets, financial statements, board resolutions, DD reports, NDAs — sit in company-grouped folders. Getting an answer means digging through folders manually.

What’s Missing

  • No single source of truth across deals
  • No way to ask questions without manual search
  • No structured view of deal pipelines
  • No audit trail of how information was derived

What We Want

  • System watches OneDrive automatically
  • Reads & understands every document
  • Builds a queryable knowledge base
  • Answers questions via WhatsApp

Solution Overview

Karatage Brain is purpose-built for PE/VC deal operations. It watches your existing OneDrive workflow, reads every document, builds structured state, and lets the team query it naturally via WhatsApp or a web console.

flowchart LR
    OD["OneDrive\n(Source of Truth)"]
    Mirror["Local Mirror"]
    Pipeline["Pipeline\n6 Stages"]
    DB["PostgreSQL\n+ pgvector\n(The Brain)"]
    WA["WhatsApp"]
    Web["Web Console"]
    CLI["CLI"]

    OD -->|"Microsoft Graph\nDelta Sync"| Mirror
    Mirror -->|"6-stage pipeline"| Pipeline
    Pipeline --> DB
    DB --> WA
    DB --> Web
    DB --> CLI

    style OD fill:#e0effe,stroke:#0073c5,color:#064e84
    style DB fill:#f0fdf4,stroke:#16a34a,color:#166534
    style Pipeline fill:#fef9c3,stroke:#ca8a04,color:#854d0e
          

Core Principle

Watch-and-surface, operator-decides. The system reads documents and proposes structured state. It never silently fabricates data. When it’s uncertain, it surfaces a proposal for a human to resolve.

Architecture

graph TB
    subgraph office["Office (Private Network)"]
        subgraph app["Application Server"]
            subgraph docker["Docker Compose"]
                pg["PostgreSQL 16\n+ pgvector\n+ pg_trgm"]
                parser["Parser\nSidecar"]
                api["API\n(FastAPI)"]
                worker["Worker\n(Procrastinate)"]
                web["Web UI\n(Next.js)"]
                waha["WAHA\n(WhatsApp)"]
            end
            mirror["OneDrive Mirror"]
            cli["brain CLI"]
        end
        ai["AI Server\n(vLLM / Ollama)\nLocal GPU inference"]
        worker --> ai
        api --> ai
    end

    onedrive["OneDrive (M365)"]
    entra["Entra IDP"]
    cf["Cloudflare Access"]
    users["Team"]

    onedrive -->|"Graph API\nDelta Queries"| mirror
    mirror --> worker
    cli --> pg
    api --> pg
    worker --> pg
    worker --> parser
    api --> parser
    web --> api
    waha --> api

    users --> cf
    cf --> entra
    cf --> web

    style pg fill:#d1fae5,stroke:#059669
    style ai fill:#fef9c3,stroke:#ca8a04
    style docker fill:#f0f9ff,stroke:#0284c7
    style app fill:#fafafa,stroke:#d4d4d4
    style office fill:#f0fdf4,stroke:#16a34a
          

Service Responsibilities

🗄️

PostgreSQL

All state. Typed schema + pgvector embeddings + full-text search. Single database, no external deps.

📄

Parser Sidecar

Stateless. Converts bytes → text. PDF, Office, images. Vision LLM for OCR on scanned docs.

API (FastAPI)

HTTP API. Serves web UI, WhatsApp webhook, RAG queries. Enqueues async jobs.

⚙️

Worker

Procrastinate async jobs. Pipeline stages, OneDrive sync, embedding backfills.

🌐

Web (Next.js)

Operator console. Dashboard, entity browsers, pipeline queues, ask-the-brain.

📱

WAHA

Self-hosted WhatsApp gateway. One dedicated number. Bridges WhatsApp ↔ HTTP.

🧠

AI Server

Dedicated GPU hardware on the private network. Serves all LLM inference (classify, extract, embed, OCR, RAG). OpenAI-compatible API.

The Ingestion Pipeline

Six stages transform unstructured documents into typed, queryable state. Each stage is independently cached and re-runnable.

flowchart TD
    mirror["OneDrive Mirror"] --> discover

    discover["DISCOVER\nSHA-256 hash = identity\nNew hash? Create row.\nKnown hash? Skip."]
    parse["PARSE\nParser sidecar: PDF, DOCX,\nXLSX, images to text chunks\nwith page locators"]
    classify["CLASSIFY\nLLM scores document against\nall known types.\nThreshold: min 0.80, margin min 0.20"]
    adjudicate["ADJUDICATE\nPair-specific LLM call\nto disambiguate"]
    extract["EXTRACT\nPer-type structured extraction\nLLM to Pydantic model to JSON\nwith per-field confidence"]
    resolve["RESOLVE\nMap names/IDs to canonical\ndatabase entities.\nAmbiguous = proposal"]
    apply["APPLY\nStratified fixpoint loop.\nPreconditions + effects.\nLoop until convergence."]
    unknown["UNKNOWN\nOperator triage queue"]

    discover --> parse --> classify
    classify -->|"Confident"| extract
    classify -->|"Below threshold"| unknown
    classify -->|"Margin too narrow"| adjudicate
    adjudicate -->|"Resolved"| extract
    adjudicate -->|"Still unclear"| unknown
    extract --> resolve --> apply

    style discover fill:#e0effe,stroke:#0073c5
    style parse fill:#e0effe,stroke:#0073c5
    style classify fill:#fef9c3,stroke:#ca8a04
    style adjudicate fill:#fef3c7,stroke:#d97706
    style extract fill:#d1fae5,stroke:#059669
    style resolve fill:#d1fae5,stroke:#059669
    style apply fill:#d1fae5,stroke:#059669
    style unknown fill:#fee2e2,stroke:#dc2626
          

Stage Caching (Make-Style Invalidation)

Each stage writes output to columnar caches on the documents row. Re-runs only when the version changes.

StageCached ColumnsVersion KeyInvalidation Trigger
Parseparsed_at, parser_version, parse_outputPARSER_VERSION constantBump the constant
Classifyclassified_at, classifier_version, document_typeSHA-256 of all classifier profilesEdit any profile
Extractextracted_at, extractor_version, extraction_resultSHA-256 of per-type profileEdit the type’s profile
Applyapplication_status(none — runs every eligible doc)Every fixpoint pass

Document Identity: SHA-256 Content Hash

Every document is identified by the SHA-256 hash of its file bytes — not its filename or path. A renamed file is the same document. A file copied to a different folder is the same document. Weekly full syncs skip all unchanged files. Deduplication is automatic.

OneDrive Integration

sequenceDiagram
    participant Op as Operator
    participant CLI as brain CLI
    participant Graph as Microsoft Graph API
    participant Mirror as Local Mirror
    participant DB as PostgreSQL
    participant Pipeline as Pipeline

    Op->>CLI: brain sync
    CLI->>DB: Load delta token
    DB-->>CLI: token (or null for first sync)

    alt First sync (no token)
        CLI->>Graph: GET /delta (full tree)
        Graph-->>CLI: All files + deltaLink
    else Incremental sync
        CLI->>Graph: GET {deltaLink}
        Graph-->>CLI: Changed files + new deltaLink
    end

    loop Each changed file
        CLI->>Graph: Download file
        Graph-->>CLI: File bytes
        CLI->>Mirror: Write to mirror/
    end

    CLI->>DB: Store new delta token
    CLI->>Pipeline: Trigger ingestion on changed files
    Pipeline->>DB: discover, parse, classify, ...
          

Why Manual Trigger?

The OneDrive corpus changes slowly — a few documents per week, not per minute. Continuous polling would be complexity without payoff.

  • Operator runs brain sync from CLI or WhatsApp
  • Delta queries ensure only changed files download
  • Content hash catches anything delta missed
  • Can upgrade to webhook-driven sync later

Resilience

  • 🛡️ Crash recovery: Delta token persists. Next run picks up where it left off.
  • 🛡️ Download failure: File skipped, retried next cycle.
  • 🛡️ Token expiry (~90 days): Falls back to full sync. Content hash prevents wasted reprocessing.
  • 🛡️ OneDrive is source of truth: Local mirror is read-only derivative.

Entity Model

The typed schema captures the PE/VC deal domain. Core entities are linked by relationship tables that model the real-world connections.

erDiagram
    FUND ||--o{ INVESTMENT : "invests via"
    DEAL ||--o{ INVESTMENT : "facilitates"
    COMPANY ||--o{ INVESTMENT : "receives"

    FUND {
        uuid id PK
        string name
        int vintage_year
        numeric fund_size
        string status
    }

    DEAL ||--o{ DEAL_PARTY : "involves"
    DEAL ||--o{ DEAL_MILESTONE : "tracks"
    DEAL ||--o{ CONTRACT : "has"
    DEAL ||--o{ DD_ITEM : "requires"
    DEAL }o--|| COMPANY : "targets"

    DEAL {
        uuid id PK
        string name
        string deal_type
        string status
        numeric deal_value
        date signed_on
    }

    COMPANY ||--o{ FINANCIAL_DATA : "reports"
    COMPANY ||--o{ DEAL_PARTY : "participates"
    CONTACT ||--o{ DEAL_PARTY : "participates"
    CONTACT }o--o| COMPANY : "works at"

    COMPANY {
        uuid id PK
        string name
        string registration_number
        string country
        string industry
    }

    CONTACT {
        uuid id PK
        string full_name
        string email
        string phone
        string role_type
    }
          

Canonical Identity Challenge

Unlike regulatory compliance (where entities have government-issued IDs), the PE/VC domain has weaker identifiers:

EntityStrong IDFallbackResolution Strategy
CompanyRegistration numberName + countryOperator confirms via proposal
ContactEmail addressName + company affiliationOperator confirms via proposal
DealOperator-assigned slugFolder name as strong hint
FundOperator-assigned nameExact match only

The system never guesses identity. When it can’t match with confidence, it creates a proposal for the operator to resolve. This prevents the worst outcome: silently linking the wrong entities.

Discovery-First Schema Design

This entity model is hypothesised. Phase 0 includes a deliberate discovery step: explore the actual OneDrive corpus, classify sample documents, identify what entities and document types exist, and design the schema from evidence — not speculation.

Classification & Extraction

Classification

The classifier determines what kind of document each file is. It uses markdown profiles — one per document type — assembled into a single LLM prompt.

# term_sheet.md

What it is: Binding or non-binding offer

Signals: “term sheet”, “indicative offer”, purchase price, conditions precedent

Distinguish from: LOI, SPA, MOU

Decision Gate

Classified

Score ≥ 0.80 AND gap to 2nd ≥ 0.20

⚖️

Adjudicator

Margin too narrow → pair-specific disambiguation

Unknown

Below threshold → operator triage queue

The Adjudicator — How It Resolves Close Calls

When the classifier can’t decide between two types (both score high, but the gap is too narrow), a specialised adjudicator fires for that specific pair.

flowchart LR
    C["Classifier Output\nbank_confirmation: 0.82\nproof_of_address: 0.75\nGap: 0.07 < 0.20"]
    A["Adjudicator\nbank_vs_proof_of_address\n\n'Does this confirm banking\ndetails or merely an address?'"]
    R["Resolved:\nbank_confirmation (0.91)"]

    C -->|"Margin fail"| A -->|"Sharp question\nsharp answer"| R

    style A fill:#fef3c7,stroke:#d97706
    style R fill:#d1fae5,stroke:#059669
              

Why not just make the classifier better? The classifier sees ~30 types simultaneously. Adjudicators see exactly two. Sharper question → sharper answer. Cheaper and more reliable than one mega-prompt handling every pairwise confusion.

Four Document-Type Shapes

Not every document needs full extraction. Four levels of processing:

ignore Classify only

Marketing brochure, duplicate cover letter. Not worth processing.

link-only Classify + extract IDs + link

NDA (link to company), CV (link to contact). Shows on entity profiles. 30 min to add.

capture-only Full extract, no mutations

DD report (findings stored as JSON, not typed rows). Visible + searchable, no schema change.

fully typed Extract + resolve + apply

Term sheet → creates Deal + Contract + Milestones. Full pipeline. ~1 day to add.

The Stratified Fixpoint

The Problem: Out-of-Order Documents

Documents arrive in unpredictable order. A board resolution approving a deal might be ingested before the term sheet that creates the deal record. Naïve approaches (ordered processing, retry queues) add complexity and don’t converge.

The Solution

Borrowed from Datalog evaluation. Each applier declares preconditions and effects. The apply stage runs in a loop until convergence.

flowchart TD
    subgraph R1["Round 1"]
        ts1["Term Sheet\nPrecondition: none\nCreates Deal 'Acme'\nCreates Company 'Acme Corp'"]
        br1["Board Resolution\nPrecondition: Deal exists\nDeal doesn't exist yet\nStatus: PENDING"]
    end

    subgraph R2["Round 2"]
        br2["Board Resolution\nPrecondition: Deal exists\nDeal now exists!\nCreates Milestone 'board_approval'"]
    end

    subgraph R3["Round 3"]
        conv["No changes = CONVERGED"]
    end

    R1 --> R2 --> R3

    style ts1 fill:#d1fae5,stroke:#059669
    style br1 fill:#fef3c7,stroke:#d97706
    style br2 fill:#d1fae5,stroke:#059669
    style conv fill:#e0effe,stroke:#0073c5
          

Self-Healing

New documents that satisfy blocked preconditions automatically unlock blocked docs in the next pass. No manual intervention.

Convergence

Max N passes (default 10). Each pass only adds state (monotonic). No changes = converged. Still pending after convergence = blocked.

Status Machine

extracted → pending → applied | blocked | rejected

RAG — Answering Questions

sequenceDiagram
    participant U as User (WhatsApp/Web)
    participant P as Phase 1: PLANNER
    participant D as Phase 2: DISPATCH
    participant S as Phase 3: SYNTHESIZER

    U->>P: "What's the status of the Acme deal?"

    Note over P: LLM call #1 (JSON mode)
    P->>P: Maps to tool: deal_status
    P->>P: Params: {kind: "deal", name: "Acme"}

    P->>D: {intent, tool, params}

    Note over D: Deterministic code (no LLM)
    D->>D: Resolve "Acme" → Deal ID
    D->>D: Run deal_status tool → structured data
    D->>D: Retrieve top-k chunks (vector + lexical)

    D->>S: tool_results + retrieved_chunks

    Note over S: LLM call #2 (plain text)
    S->>S: Write grounded answer
    S->>S: Cite only provided documents

    S->>U: "The Acme acquisition is in due diligence.\nTerm sheet signed 2026-03-15 for R50M.\n[doc:a1b2c3]"
          

Why Three Phases?

ApproachProblem
Single LLM callCan’t do structured lookups. Hallucinates data. No verifiable citations.
LLM + function callingModel picks wrong tools, retries burn tokens, latency spikes.
Planner → Code → SynthesizerEach phase is constrained. Planner can’t access data. Dispatch is deterministic. Synthesizer only writes from provided facts.

Available Tools

deal_status

Stage, value, key dates

deal_parties

Companies & contacts with roles

deal_timeline

Milestones: done, pending, overdue

company_profile

Registration, industry, deals

fund_portfolio

All investments in a fund

financials_for

Revenue, EBITDA, valuations

dd_status

DD progress, findings, risk

contracts_for

Agreements: type, status, dates

whats_outstanding

Pending items, overdue milestones

Hybrid Retrieval: Vector + Lexical (RRF)

Retrieval combines two search modes via Reciprocal Rank Fusion — neither vector nor keyword search alone is sufficient.

flowchart LR
    Q["Query: 'Acme financials'"]
    V["Vector Search\nEmbed query,\ncosine similarity\nover chunks"]
    L["Lexical Search\nFull-text match\non document_type + path"]
    RRF["Reciprocal Rank\nFusion (RRF)\nscore = sum 1/(K+rank)\nK = 60"]
    R["Top-k chunks\nwith doc IDs\n+ page locators"]

    Q --> V & L
    V & L --> RRF --> R

    style RRF fill:#e0effe,stroke:#0073c5
              

Entity-scoped retrieval: When the planner resolves a specific entity, search is scoped to documents linked to that entity via document_subjects. Prevents leakage between deals.

WhatsApp Integration

sequenceDiagram
    participant WA as WhatsApp User
    participant WAHA as WAHA Gateway
    participant API as Brain API
    participant RAG as RAG Agent
    participant DB as PostgreSQL

    WA->>WAHA: Send message
    WAHA->>API: POST /whatsapp/webhook (HMAC-SHA512)
    API->>API: Verify HMAC signature
    API->>DB: Resolve sender → known contact
    API->>DB: Check wa_operators allowlist

    alt Authorized operator
        API->>RAG: Process question
        RAG->>DB: Plan → Dispatch → Retrieve
        RAG-->>API: Grounded answer + citations
        API->>WAHA: Send reply
        WAHA->>WA: Deliver answer
    else Unknown sender
        API->>DB: Store message (no answer)
    end
          

Self-Hosted

WAHA runs locally. No Meta Business API approval. Data stays on our hardware. One dedicated phone number.

Group Intelligence

In group chats, bot only responds when @mentioned. DMs always answered (if authorized).

Media Ingestion

Documents sent via WhatsApp (PDFs, photos of contracts) are ingested into the pipeline.

User Experience

Three surfaces for interacting with the Brain: a desktop web console for operators, a mobile-responsive view for on-the-go access, and WhatsApp for instant Q&A with rich deal summaries.

Desktop — Operator Dashboard

brain.karatage.io Karatage Brain Dashboard Deals Companies Contacts Funds PIPELINE Documents Blocked 3 Proposals 7 Errors TOOLS Ask the Brain Sync Status Dashboard Search deals, companies, docs... Active Deals 12 +2 this month Documents Ingested 2,847 Pending Review 10 proposals + blocked Last Sync 2h ago 14 new files Deal Pipeline Sourcing 5 Due Diligence 3 IC Approved 2 Closing 2 Recent Deals DEAL NAME COMPANY STATUS VALUE LAST ACTIVITY Project Phoenix TechVentures Ltd IC Approved R 45M 2 hours ago Greenfield Acquisition AgriCorp Holdings Due Diligence R 120M Yesterday Solar Power Fund II Helios Energy Sourcing R 80M 3 days ago MedTech Series B NovaBio Diagnostics Closing R 35M 1 week ago Logistics Platform FreightLink SA Due Diligence R 65M 2 weeks ago

Desktop — Deal Profile

brain.karatage.io/deals/project-phoenix K Deals / Project Phoenix Project Phoenix IC Approved TechVentures Ltd · Acquisition · Fund: Growth Fund III Overview Documents Timeline Financials DD Items Audit Trail DEAL DETAILS Deal Value R 45,000,000 Type Acquisition (100%) Sourced 12 Jan 2026 IC Date 18 Apr 2026 Instrument Equity KEY PARTIES Buyer Growth Fund III Target TechVentures Ltd Legal Advisor Webber Wentzel Lead Contact Sarah Chen DOCUMENTS 23 linked Term Sheet (signed) term_sheet · 2 Mar 2026 Milestones NDA Signed 15 Jan 2026 DD Commenced 2 Feb 2026 Term Sheet Signed 2 Mar 2026 IC Approval 18 Apr 2026 SPA Signing Due 15 Jun 2026 Closing Ask about this deal Ask a question about Project Phoenix... What's outstanding on this deal? SPA signing due 15 Jun. Legal review of warranties in progress. BEE cert pending. [doc:a3f2c1] [doc:b7e4d2]

Mobile & WhatsApp

Mobile-responsive dashboard

9:41 Karatage Brain Sync Search deals, companies... Active Deals 12 Pending 10 Recent Deals Project Phoenix TechVentures Ltd · R 45M IC Approved 2h ago Greenfield Acquisition AgriCorp Holdings · R 120M Due Diligence Yesterday Solar Power Fund II Helios Energy · R 80M Sourcing 3 days Deals Companies Pipeline Ask

WhatsApp Q&A with rich deal summaries

KB Karatage Brain online What's the status of Phoenix? Project Phoenix IC approved on 18 Apr 2026. SPA signing due 15 Jun. Legal review of warranties in progress. BEE cert outstanding. Sources: term_sheet.pdf, ic_memo.pdf Give me a deal summary for Mark DEAL SUMMARY Project Phoenix TARGET TechVentures Ltd VALUE R 45M STATUS IC Approved FUND Growth Fund III NEXT MILESTONE SPA Signing — due 15 Jun 2026 OUTSTANDING • Legal warranty review • BEE certificate Message

Example Interactions

Questions via WhatsApp

>

“What deals are in due diligence?”

Lists all active DD deals with target companies, values, and days in DD.

>

“Send me the Phoenix deal summary”

Returns a formatted deal card with status, milestones, parties, and outstanding items.

>

“What’s outstanding across all deals?”

Aggregates overdue milestones, pending DD items, and expiring contracts across the portfolio.

>

“Who is the legal advisor on Greenfield?”

Returns the contact and firm with a link to the engagement letter.

Actions via WhatsApp

>

“Sync”

Triggers an OneDrive sync. Reports back with how many new files were found and processed.

>

[sends a PDF via WhatsApp]

Document is ingested immediately. Classified, extracted, and linked to the relevant deal.

>

“What’s blocked?”

Lists documents stuck in the pipeline with reasons (missing entity, ambiguous classification).

>

“Weekly digest”

Summary of new documents ingested, deals that changed status, and items needing attention.

Auth & Access Control

flowchart LR
    U["Team Members"]
    CF["Cloudflare Access"]
    E["Entra IDP\n(Azure AD)"]
    JWT["JWT Token"]
    API["Brain API"]
    WA["WhatsApp"]
    HMAC["HMAC-SHA512\nVerification"]

    U -->|"Web / API"| CF
    CF --> E
    E --> JWT
    JWT --> API
    WA -->|"Webhook"| HMAC
    HMAC --> API

    style CF fill:#f0f9ff,stroke:#0284c7
    style E fill:#e0effe,stroke:#0073c5
          

Web & API

Cloudflare Access with Entra IDP (Azure AD). Zero-trust. No VPN. Existing Microsoft identity. Group-based access control.

WhatsApp

Separate path. HMAC-verified webhooks. Phone number allowlist (wa_operators). No Cloudflare in this path.

Tech Stack

LayerTechnologyWhy
BackendPython 3.12 / FastAPIAsync-first. Strong LLM ecosystem. Production-grade.
DatabasePostgreSQL 16 + pgvectorOne database for everything: schema, vectors, FTS, job queue.
Job QueueProcrastinatePostgres-native. No Redis/RabbitMQ. Transactional guarantees.
FrontendNext.js 15 / React / TailwindServer components. Radix UI. Fast iteration.
WhatsAppWAHA (self-hosted)Free core. No Meta approval. Self-hosted = data stays local.
LLMSelf-hosted (vLLM / Ollama)All inference on local hardware. No data leaves the network. Gemma / Llama class models for classify/extract. Local embedding model for vectors.
OneDriveMicrosoft Graph SDKOfficial SDK. Delta queries. Client credentials flow.
AuthCloudflare Access + EntraZero-trust. Existing Microsoft identity.
DeployDocker Compose / self-hosted serverSimple. Office server (Mac mini or similar). Data never leaves the building. No cloud infra.
Why Not…?
AlternativeWhy Not
Cloud (AWS/GCP)Overkill. Single team. Office server sufficient. Data stays local.
Pinecone / Weaviatepgvector is fine for <100K chunks. One less service.
Celery / RedisProcrastinate uses Postgres. One less service.
LangChainToo abstract. Direct LLM calls are simpler and debuggable.
Cloud LLM APIs (OpenAI, etc.)Deal documents are confidential. Local inference = zero data exfiltration risk. No per-token cost at scale.
Fine-tuned modelsPrompt-based is sufficient. Profiles editable by operators.

Local AI Inference

All AI inference runs on a dedicated server inside the office. No document text, no extracted data, and no queries ever leave the private network.

flowchart LR
    subgraph office["Office Private Network"]
        app["Brain\n(Application Server)"]
        ai["AI Server\n(GPU)"]
        app -->|"classify / extract\n/ embed / OCR"| ai
        ai -->|"structured output"| app
    end

    internet["Public Internet"]
    app -.->|"OneDrive sync only\n(file download)"| internet

    style office fill:#f0fdf4,stroke:#16a34a
    style ai fill:#fef9c3,stroke:#ca8a04
    style internet fill:#fee2e2,stroke:#dc2626
          

Zero Data Exfiltration

Deal documents are confidential. With local inference, document text is never sent to a third-party API. The only outbound traffic is OneDrive file downloads and (optionally) model weight updates.

AI Server Roles

RoleModel ClassServingNotes
ClassifierGemma 3 27B / Llama 3.1 8BvLLM or OllamaScores documents against type profiles. ~30 types per prompt.
ExtractorGemma 3 27B / Llama 3.1 70BvLLM or OllamaPer-type structured extraction. JSON mode output.
EmbedderBGE / Nomic Embed / E5Sentence Transformers768-dim vectors for document chunks. Batch processing.
OCRGemma 4 / Llama VisionvLLMVision model for scanned documents and images.
RAG SynthesizerGemma 3 27B / Llama 3.1 70BvLLM or OllamaGenerates grounded answers from retrieved context.

Hardware Options

recommended Mac Studio / Mac Pro (Apple Silicon)

M2 Ultra or M4 Max with 192GB+ unified memory. Runs 70B models comfortably via MLX or Ollama. Silent. Low power. Fits on a shelf.

Unified memory = no GPU VRAM bottleneck. Entire model in memory.

alternative Linux Server + NVIDIA GPU

RTX 4090 (24GB) or A6000 (48GB). vLLM with CUDA. Higher throughput for batch workloads. Standard MLOps tooling.

Requires active cooling. Higher power draw. Better batch throughput.

Privacy

No API calls. No data leaving the network. Full control over model versions and behavior.

Cost

One-time hardware investment. No per-token charges. Thousands of documents processed for the cost of electricity.

Flexibility

Swap models freely. Test new releases same day. No vendor lock-in. OpenAI-compatible API (vLLM/Ollama) means the application code doesn’t change.

Key Design Decisions

1. Content-Hash Document Identity

Decision: SHA-256 of file bytes is the unique identifier, not filename or path.

Why: Files get renamed, moved, duplicated. “Term Sheet v2 FINAL (2).pdf” is the same document as “Term Sheet v2 FINAL.pdf” if the bytes are identical. Hash identity = zero wasted reprocessing on renames, automatic deduplication.

2. Watch-and-Surface, Operator-Decides

Decision: The system proposes, never acts unilaterally on uncertain data.

Why: In PE/VC, linking the wrong entity to a deal has real consequences. Three operator queues:

  • Blocked — preconditions not met (self-healing when resolved)
  • Proposals — uncertain entity resolutions needing human judgement
  • Errors — pipeline failures needing technical attention
3. Markdown Classifier Profiles (Not Code)

Decision: Document type signals described in markdown files, not Python code.

Why: Operators can edit profiles without touching code. Adding a new document type is a 30-minute task. Edits auto-trigger reclassification. Fastest path from “we found a new doc type” to “the system handles it.”

4. Stratified Fixpoint (Not Retry Queues)

Decision: One mechanism (fixpoint loop) handles all temporal dependencies.

Why: Retry queues are ad-hoc — per-type retry logic, dead-letter queues, manual reprocessing. The fixpoint is one mechanism: out-of-order docs, missing references, cascading creation. Self-healing. Convergence guaranteed.

5. Three-Phase RAG (Not Single-Call)

Decision: Separate planning, data retrieval, and answer synthesis into three phases.

Why: Single-call RAG hallucinates. It invents data that sounds right but isn’t in the documents. Three phases: planner can’t access data, dispatch is deterministic code, synthesizer only writes from provided facts. Citations are verifiable.

6. OneDrive as Source of Truth

Decision: OneDrive is canonical. Brain is a read-only derivative.

Why: The team already works in OneDrive. Asking them to upload to a separate system = friction that kills adoption. Brain watches their existing workflow — the “ghost in the machine” principle.

7. Office-Hosted, Air-Gapped from the Internet

Decision: Self-hosted on a server inside the office, isolated from the public internet.

Why: Deal documents are sensitive. The server (Mac mini or similar) lives inside the office on the private network. Data never leaves the building. Docker Compose keeps operations simple. Architecture ports to any Docker host if scale demands change.

8. Local AI Inference (Not Cloud APIs)

Decision: All LLM inference runs on a dedicated server inside the office. No document text sent to external APIs.

Why: Deal documents are confidential. Cloud LLM APIs mean every document, every query, every extracted fact transits a third party’s infrastructure. Local inference eliminates that risk entirely. One-time hardware cost replaces ongoing per-token charges. Models are swappable without code changes (OpenAI-compatible serving via vLLM/Ollama).

Implementation Phases

Foundation + Discovery

Weeks 1–2

Stand up infrastructure. Explore the corpus. Design entity schema from evidence.

Repo scaffolding, Docker Compose, Alembic
Postgres + pgvector + parser sidecar
OneDrive sync (authenticate, full download)
Exploration CLI — parse samples, discover types
Key output: Report on document types & entities found. Schema designed from evidence.

Core Pipeline

Weeks 3–5

Ingest the full corpus through the pipeline.

Schema migration (typed tables from Phase 0)
10–15 classifier profiles
5–8 extractors for structured doc types
3–5 appliers for state-creating types
Full pipeline wired: discover → apply
Delta query sync for incremental updates

RAG + WhatsApp

Weeks 6–7

The team can ask the brain questions via WhatsApp.

Document chunk embeddings
12 deals-domain RAG tools
RAG answer agent (3-phase)
WAHA setup + WhatsApp bot

🎯 First value delivery: team asks questions via WhatsApp and gets grounded, cited answers about any deal.

Operator Console

Week 8+

Web UI for reviewing pipeline output and browsing entities.

Dashboard: deals, companies, pipeline health
Review queues: blocked / proposals / errors
Entity browsers: deal, company, contact profiles
Document viewer with audit trail

Risk & Mitigation

RiskImpactMitigation
OneDrive API issuesSync breaksDelta tokens resilient. Full-sync fallback. Content hash = cheap reprocessing.
LLM classification accuracyWrong types → wrong extractionThreshold gate. Operator triage queue. Adjudicators for common confusions.
Entity resolution ambiguityWrong entities linkedNever-guess principle. Proposals queue. Canonical IDs where available.
Sensitive data exposureDeal data leakedOffice-hosted server on private network. All inference local. Cloudflare + Entra auth. HMAC webhooks.
Server failureSystem downDocker volumes are only state. Backup + restore on any Docker host. pg_dump.
Corpus too largeFirst ingest takes days8–24 parallel docs. Content hash skip. Process incrementally by folder.
Schema wrongRework neededPhase 0 discovery = schema from evidence, not speculation. Migrations evolve.