Knowledge Discovery Pipeline
How TerraGuard automatically discovers, crawls, validates, and indexes external knowledge sources to build a comprehensive information base for each disaster event.
Overview
The Knowledge Discovery Pipeline is responsible for building a rich, up-to-date knowledge base for every disaster event. It automatically searches the web for relevant content, crawls and extracts text from discovered pages, validates relevance using an LLM, and indexes the content as vector embeddings for semantic retrieval.
This pipeline runs as a set of Inngest functions triggered after event ingestion. It operates continuously -- re-running on schedule to discover new content as events develop over time.
Pipeline Steps
Step 1: Query Generation
The orchestrator generates multiple search queries from the event metadata. Rather than searching for a single generic query, it produces targeted queries that cover different aspects of the event:
| Query Type | Example (M7.2 Earthquake, Turkey) |
|---|---|
| Primary | "Turkey earthquake magnitude 7.2" |
| Location-specific | "Kahramanmaras earthquake damage" |
| Impact-focused | "Turkey earthquake casualties displacement" |
| Response-focused | "Turkey earthquake humanitarian response" |
| Source-specific | "OCHA Turkey earthquake situation report" |
Typically 4-6 queries are generated per event. The number scales with event severity -- HIGH priority events generate more queries to ensure comprehensive coverage.
Step 2: Multi-Provider Search
Each query is sent through the backend's search layer (app/common/search_providers.py), which calls Serper.dev first and falls back to Brave Search only on error:
Serper.dev is the primary provider — it proxies Google's web and news indexes through a single authenticated API (~10 results/page, up to 10 pages). Brave Search is the error-triggered fallback, called only when Serper raises an error (missing key, network failure, upstream 5xx); a successful-but-empty Serper response does not trigger fallback. Both providers normalize into a common SearchResult shape, which is deduplicated by URL and returned as a ranked list. There is no SearXNG metasearch engine and no Tor proxy in the path.
Step 3: URL Deduplication
Before processing, discovered URLs are checked against the external_knowledge_bases table. URLs that have already been crawled (regardless of status) are skipped. This prevents redundant crawling across pipeline runs.
URLs are also deduplicated across the current batch -- if multiple search queries return the same URL, it is processed only once.
Step 4: Fan-Out to Workers
Validated URLs are distributed to concurrent workers with a configurable concurrency limit (default: 6). Each worker handles a single URL independently.
Step 5: Crawl (Web or PDF)
Each URL follows one of two crawl paths depending on content type:
Web crawl path: The Crawler API manages async jobs with a 4-level strategy fallback chain. If the basic strategy fails (JavaScript-heavy pages, bot detection), it automatically escalates through Patient (longer waits), Undetected (headless browser with Xvfb), and Proxy strategies.
PDF extraction path: Situation reports, official advisories, and academic papers are often published as PDFs. These are processed using Docling, which extracts structured text, tables, and metadata from PDF documents.
Step 6: LLM Validation
Crawled content is passed to the News Filter Agent for relevance validation. The agent determines whether the content is:
- Relevant -- genuinely about this specific disaster event
- Partially relevant -- mentions the event but is primarily about something else
- Irrelevant -- not related to the event (false positive from search)
Only content classified as relevant proceeds to indexing. Partially relevant content is stored but not indexed. Irrelevant content is marked as such and excluded.
Step 7: Chunking
Validated content is split into chunks for embedding. The chunking strategy uses:
- Chunk size: 512 tokens (optimized for embedding model context window)
- Overlap: 64 tokens (ensures no information is lost at chunk boundaries)
- Splitting strategy: Paragraph-aware splitting that respects sentence boundaries
Each chunk retains metadata linking it back to its source URL and position within the original document.
Step 8: Embedding Generation
Each chunk is embedded using OpenAI's embedding model. The resulting vectors are stored in the disaster_event_embeddings table, which uses pgVector for efficient similarity search.
The pgVector table uses an HNSW (Hierarchical Navigable Small World) index for approximate nearest-neighbor search, enabling sub-millisecond similarity queries across millions of embeddings.
The External Knowledge Base Model
Each discovered URL is tracked as an external_knowledge_base record:
| Field | Description |
|---|---|
url | The discovered URL |
disaster_event_id | Associated disaster event |
crawl_status | Current pipeline status |
content | Extracted full text |
ai_summary | LLM-generated summary of the content |
relevance_score | Confidence that content is relevant (0-1) |
source_type | Classification (news, sitrep, official, academic) |
discovered_at | When the URL was first found |
crawled_at | When content was extracted |
word_count | Length of extracted content |
Relevance Scoring
Each knowledge base entry receives a relevance score from 0.0 to 1.0, computed by the News Filter Agent. The score considers:
- Topical match -- Does the content discuss this specific event?
- Recency -- Is the content from the event's active period?
- Source authority -- Is it from a recognized news or humanitarian source?
- Information density -- Does it contain substantive facts vs. generic coverage?
Content scoring above 0.7 is automatically indexed. Content between 0.4 and 0.7 is indexed but flagged for review. Content below 0.4 is rejected.
AI Summary Generation
For each validated knowledge base entry, an AI-generated summary is produced and stored in the ai_summary field. These summaries are:
- 2-3 sentences capturing the key information
- Written in factual, neutral tone
- Used for display in the frontend knowledge base view
- Distinct from the full content used for embedding
Always-Crawl URLs
Certain URLs are configured to be crawled regardless of search discovery. These are authoritative sources that are always relevant when an event of a specific type occurs:
| Event Type | Always-Crawl Sources |
|---|---|
| Earthquake | USGS event page, EMSC page, GDACS report page |
| Tropical Cyclone | NHC advisory page, JTWC warning, GDACS report page |
| Flood | GDACS report page, relevant national flood authority |
| Volcano | Smithsonian GVP page, GDACS report page |
These URLs are generated from templates using the event's source IDs and injected into the pipeline before the search step.
Scheduling
The knowledge discovery pipeline runs on a defined schedule tied to event age:
Early runs (T+0 through T+6h) focus on breaking news and initial situation reports. Later runs (T+12h through T+48h) target comprehensive analyses, official assessments, and updated casualty figures.
For events that remain active beyond 48 hours, additional runs can be triggered manually or by the notification engine's scheduled follow-ups.
News and Media Sync
A separate scheduled job synchronizes news coverage for active HIGH-priority events:
- Runs every 4 hours for events less than 3 days old
- Runs every 12 hours for events between 3-7 days old
- Stops for events older than 7 days (unless manually re-enabled)
This ensures that the knowledge base stays current as the situation evolves, capturing new developments, updated figures, and response progress.
Pipeline Metrics
The knowledge discovery pipeline tracks:
| Metric | Description |
|---|---|
| URLs discovered | Total URLs found across all search queries |
| URLs deduplicated | URLs skipped (already in knowledge base) |
| Crawl success rate | Percentage of URLs successfully extracted |
| Validation pass rate | Percentage of crawled content classified as relevant |
| Embedding count | Total vector embeddings generated per event |
| End-to-end latency | Time from trigger to all embeddings stored |
Typical pipeline performance for a HIGH-priority earthquake:
- 60-80 URLs discovered across 6 queries
- 35-50 unique URLs after dedup
- 25-40 successfully crawled
- 15-30 validated as relevant
- 200-400 embedding chunks generated
- Total pipeline time: 3-5 minutes
AI Agents & MCP Servers
The six AI agents powering TerraGuard's analysis pipeline, plus the five MCP tool servers that enable the report generation agent to produce comprehensive disaster reports.
Backend API
Python/FastAPI core API powering TerraGuard with 40+ endpoints, 6 AI agents, 5 MCP servers, and Inngest-driven async workflows for disaster event management.