TerraGuard

Knowledge Base & AI Summaries

How TerraGuard discovers, crawls, validates, and indexes external knowledge for each disaster event, enabling AI summaries and RAG-powered question answering.

Overview

The knowledge base is the foundation of TerraGuard's AI capabilities. For every disaster event, the platform automatically discovers relevant content from across the web — news articles, situation reports, government advisories, and humanitarian updates — then crawls, validates, and indexes them for semantic retrieval.

This indexed knowledge powers two key features: AI-generated event summaries and the RAG (Retrieval-Augmented Generation) chat interface.

Knowledge Discovery Pipeline

The discovery pipeline runs automatically when a new event is ingested or updated. It consists of four stages:

Loading diagram...

The backend's search layer runs web and news searches with terms derived from the event data — event type, location, magnitude, and date. Multiple search queries are generated to maximize coverage:

  • Primary query: event-specific terms (e.g., "earthquake Papua New Guinea magnitude 6.8")
  • Secondary queries: broader terms, affected region names, humanitarian response terms
  • Source-specific queries: targeting known humanitarian information sources

Search is an internal Backend API module (app/common/search_providers.py), not a separate service. It calls Serper.dev (Google web + news) as the primary provider and falls back to Brave Search only when Serper errors. Results from either provider are normalized into a common shape, deduplicated by URL, and returned as a ranked list with titles and snippets.

Stage 2: LLM Validation

Not every search result is relevant. The news filter agent — an LLM-based classifier — evaluates each result to determine whether it genuinely relates to the disaster event. This prevents noise from entering the knowledge base.

The agent considers:

  • Does the article discuss the same event (matching location, date, type)?
  • Is the content substantive or is it a stub/duplicate?
  • Is the source credible?

Results that pass validation proceed to crawling. Typically 40-70% of search results are filtered out at this stage.

Stage 3: Content Crawling

Validated URLs are sent to the Web Crawler API for full content extraction. The crawler uses a 4-level strategy fallback chain to maximize extraction success:

  1. Basic — Standard HTTP fetch with content extraction
  2. Patient — Extended timeouts and wait-for-content strategies
  3. Undetected — Browser-based crawling with Xvfb for JavaScript-rendered pages
  4. Proxy — Proxy-routed requests for geo-restricted or bot-protected content

The crawler returns clean, structured text content stripped of navigation, ads, and boilerplate.

Stage 4: Vector Indexing

Crawled content is split into chunks, embedded using OpenAI's embedding model, and stored in PostgreSQL with the pgVector extension. Each chunk maintains a reference to its source URL, event ID, and metadata.

This enables semantic search — instead of keyword matching, the system finds content based on meaning, which is critical for the RAG features.

Knowledge Base Interface

Screenshot: Knowledge base list view showing indexed articles for a specific event, with titles, sources, dates, content types, and relevance indicators

The knowledge base tab on the event detail page displays all indexed content for that event. Each entry shows:

FieldDescription
TitleArticle or document headline
SourcePublication or organization name
URLLink to the original content
TypeNews article, SITREP, advisory, report
Indexed DateWhen the content was crawled and indexed
ChunksNumber of vector chunks created

You can search within the knowledge base using the search bar, which performs semantic search over the indexed content.

AI Summary Generation

AI summaries synthesize all available knowledge about an event into a concise, structured overview.

Screenshot: AI summary generation interface showing a "Generate Summary" button, a loading state with progress indicator, and the resulting summary with sections for situation overview, impact, and response

How Summaries Work

When you request a summary, the system:

  1. Retrieves the most relevant knowledge base chunks using semantic search
  2. Includes event measurements, affected countries, and timeline data
  3. Sends the assembled context to the AI model with structured prompting
  4. Returns a multi-section summary covering:
    • Situation Overview — What happened, where, and when
    • Impact Assessment — Affected population, infrastructure damage, casualties
    • Response Status — Humanitarian actions, government declarations, aid mobilization
    • Key Developments — Recent changes and emerging concerns
    • Outlook — Expected trajectory and anticipated needs

Regenerating Summaries

Summaries can be regenerated at any time. This is useful when:

  • New articles have been indexed since the last summary
  • The event situation has evolved significantly
  • You want to incorporate a specific angle or focus

Each regeneration uses the latest available knowledge base content.

RAG Chat

The RAG chat interface allows you to ask free-form questions about any event. Unlike a general-purpose chatbot, answers are grounded in the indexed knowledge base for that specific event.

Screenshot: RAG chat interface showing a conversation thread with user questions and AI responses, with source citations linking to original articles

How RAG Chat Works

Loading diagram...
  1. Your question is embedded using the same model used for knowledge base indexing
  2. Semantic search finds the most relevant chunks from the event's knowledge base
  3. These chunks are assembled into a context window with the question
  4. The LLM generates an answer grounded in the retrieved context
  5. Source citations are included so you can verify claims against original documents

Example Questions

  • "What is the latest death toll reported?"
  • "Which roads and bridges have been damaged?"
  • "What relief supplies have been requested by the government?"
  • "Are there any tsunami warnings in effect?"
  • "What is the forecast track for the cyclone over the next 48 hours?"

Limitations

RAG chat can only answer questions based on content that has been indexed in the knowledge base. If a topic is not covered by any crawled article, the system will indicate that it does not have sufficient information rather than hallucinating an answer.

Manual Knowledge Base Management

In addition to automated discovery, users can manually add URLs to the knowledge base:

  • Paste a URL into the "Add Source" input
  • The system crawls the URL and indexes its content
  • The new content is immediately available for summaries and RAG chat

This is useful for adding internal documents, unpublished reports, or niche sources that the automated search pipeline may not discover.

Next Steps

On this page