TerraGuard

Knowledge Discovery Pipeline

How TerraGuard automatically discovers, crawls, validates, and indexes external knowledge sources to build a comprehensive information base for each disaster event.

Overview

The Knowledge Discovery Pipeline is responsible for building a rich, up-to-date knowledge base for every disaster event. It automatically searches the web for relevant content, crawls and extracts text from discovered pages, validates relevance using an LLM, and indexes the content as vector embeddings for semantic retrieval.

This pipeline runs as a set of Inngest functions triggered after event ingestion. It operates continuously -- re-running on schedule to discover new content as events develop over time.

Loading diagram...

Pipeline Steps

Step 1: Query Generation

The orchestrator generates multiple search queries from the event metadata. Rather than searching for a single generic query, it produces targeted queries that cover different aspects of the event:

Query TypeExample (M7.2 Earthquake, Turkey)
Primary"Turkey earthquake magnitude 7.2"
Location-specific"Kahramanmaras earthquake damage"
Impact-focused"Turkey earthquake casualties displacement"
Response-focused"Turkey earthquake humanitarian response"
Source-specific"OCHA Turkey earthquake situation report"

Typically 4-6 queries are generated per event. The number scales with event severity -- HIGH priority events generate more queries to ensure comprehensive coverage.

Each query is sent through the backend's search layer (app/common/search_providers.py), which calls Serper.dev first and falls back to Brave Search only on error:

Loading diagram...

Serper.dev is the primary provider — it proxies Google's web and news indexes through a single authenticated API (~10 results/page, up to 10 pages). Brave Search is the error-triggered fallback, called only when Serper raises an error (missing key, network failure, upstream 5xx); a successful-but-empty Serper response does not trigger fallback. Both providers normalize into a common SearchResult shape, which is deduplicated by URL and returned as a ranked list. There is no SearXNG metasearch engine and no Tor proxy in the path.

Step 3: URL Deduplication

Before processing, discovered URLs are checked against the external_knowledge_bases table. URLs that have already been crawled (regardless of status) are skipped. This prevents redundant crawling across pipeline runs.

URLs are also deduplicated across the current batch -- if multiple search queries return the same URL, it is processed only once.

Step 4: Fan-Out to Workers

Validated URLs are distributed to concurrent workers with a configurable concurrency limit (default: 6). Each worker handles a single URL independently.

Loading diagram...

Step 5: Crawl (Web or PDF)

Each URL follows one of two crawl paths depending on content type:

Loading diagram...

Web crawl path: The Crawler API manages async jobs with a 4-level strategy fallback chain. If the basic strategy fails (JavaScript-heavy pages, bot detection), it automatically escalates through Patient (longer waits), Undetected (headless browser with Xvfb), and Proxy strategies.

PDF extraction path: Situation reports, official advisories, and academic papers are often published as PDFs. These are processed using Docling, which extracts structured text, tables, and metadata from PDF documents.

Step 6: LLM Validation

Crawled content is passed to the News Filter Agent for relevance validation. The agent determines whether the content is:

  • Relevant -- genuinely about this specific disaster event
  • Partially relevant -- mentions the event but is primarily about something else
  • Irrelevant -- not related to the event (false positive from search)

Only content classified as relevant proceeds to indexing. Partially relevant content is stored but not indexed. Irrelevant content is marked as such and excluded.

Step 7: Chunking

Validated content is split into chunks for embedding. The chunking strategy uses:

  • Chunk size: 512 tokens (optimized for embedding model context window)
  • Overlap: 64 tokens (ensures no information is lost at chunk boundaries)
  • Splitting strategy: Paragraph-aware splitting that respects sentence boundaries

Each chunk retains metadata linking it back to its source URL and position within the original document.

Step 8: Embedding Generation

Each chunk is embedded using OpenAI's embedding model. The resulting vectors are stored in the disaster_event_embeddings table, which uses pgVector for efficient similarity search.

Loading diagram...

The pgVector table uses an HNSW (Hierarchical Navigable Small World) index for approximate nearest-neighbor search, enabling sub-millisecond similarity queries across millions of embeddings.

The External Knowledge Base Model

Each discovered URL is tracked as an external_knowledge_base record:

Loading diagram...
FieldDescription
urlThe discovered URL
disaster_event_idAssociated disaster event
crawl_statusCurrent pipeline status
contentExtracted full text
ai_summaryLLM-generated summary of the content
relevance_scoreConfidence that content is relevant (0-1)
source_typeClassification (news, sitrep, official, academic)
discovered_atWhen the URL was first found
crawled_atWhen content was extracted
word_countLength of extracted content

Relevance Scoring

Each knowledge base entry receives a relevance score from 0.0 to 1.0, computed by the News Filter Agent. The score considers:

  • Topical match -- Does the content discuss this specific event?
  • Recency -- Is the content from the event's active period?
  • Source authority -- Is it from a recognized news or humanitarian source?
  • Information density -- Does it contain substantive facts vs. generic coverage?

Content scoring above 0.7 is automatically indexed. Content between 0.4 and 0.7 is indexed but flagged for review. Content below 0.4 is rejected.

AI Summary Generation

For each validated knowledge base entry, an AI-generated summary is produced and stored in the ai_summary field. These summaries are:

  • 2-3 sentences capturing the key information
  • Written in factual, neutral tone
  • Used for display in the frontend knowledge base view
  • Distinct from the full content used for embedding

Always-Crawl URLs

Certain URLs are configured to be crawled regardless of search discovery. These are authoritative sources that are always relevant when an event of a specific type occurs:

Event TypeAlways-Crawl Sources
EarthquakeUSGS event page, EMSC page, GDACS report page
Tropical CycloneNHC advisory page, JTWC warning, GDACS report page
FloodGDACS report page, relevant national flood authority
VolcanoSmithsonian GVP page, GDACS report page

These URLs are generated from templates using the event's source IDs and injected into the pipeline before the search step.

Scheduling

The knowledge discovery pipeline runs on a defined schedule tied to event age:

Loading diagram...

Early runs (T+0 through T+6h) focus on breaking news and initial situation reports. Later runs (T+12h through T+48h) target comprehensive analyses, official assessments, and updated casualty figures.

For events that remain active beyond 48 hours, additional runs can be triggered manually or by the notification engine's scheduled follow-ups.

News and Media Sync

A separate scheduled job synchronizes news coverage for active HIGH-priority events:

  • Runs every 4 hours for events less than 3 days old
  • Runs every 12 hours for events between 3-7 days old
  • Stops for events older than 7 days (unless manually re-enabled)

This ensures that the knowledge base stays current as the situation evolves, capturing new developments, updated figures, and response progress.

Screenshot: Knowledge base view for a disaster event showing indexed articles with relevance scores, AI summaries, source types, and crawl timestamps

Pipeline Metrics

The knowledge discovery pipeline tracks:

MetricDescription
URLs discoveredTotal URLs found across all search queries
URLs deduplicatedURLs skipped (already in knowledge base)
Crawl success ratePercentage of URLs successfully extracted
Validation pass ratePercentage of crawled content classified as relevant
Embedding countTotal vector embeddings generated per event
End-to-end latencyTime from trigger to all embeddings stored

Typical pipeline performance for a HIGH-priority earthquake:

  • 60-80 URLs discovered across 6 queries
  • 35-50 unique URLs after dedup
  • 25-40 successfully crawled
  • 15-30 validated as relevant
  • 200-400 embedding chunks generated
  • Total pipeline time: 3-5 minutes

On this page