Knowledge Base & AI Summaries
How TerraGuard discovers, crawls, validates, and indexes external knowledge for each disaster event, enabling AI summaries and RAG-powered question answering.
Overview
The knowledge base is the foundation of TerraGuard's AI capabilities. For every disaster event, the platform automatically discovers relevant content from across the web — news articles, situation reports, government advisories, and humanitarian updates — then crawls, validates, and indexes them for semantic retrieval.
This indexed knowledge powers two key features: AI-generated event summaries and the RAG (Retrieval-Augmented Generation) chat interface.
Knowledge Discovery Pipeline
The discovery pipeline runs automatically when a new event is ingested or updated. It consists of four stages:
Stage 1: Web Search
The backend's search layer runs web and news searches with terms derived from the event data — event type, location, magnitude, and date. Multiple search queries are generated to maximize coverage:
- Primary query: event-specific terms (e.g., "earthquake Papua New Guinea magnitude 6.8")
- Secondary queries: broader terms, affected region names, humanitarian response terms
- Source-specific queries: targeting known humanitarian information sources
Search is an internal Backend API module (app/common/search_providers.py), not a separate service. It calls Serper.dev (Google web + news) as the primary provider and falls back to Brave Search only when Serper errors. Results from either provider are normalized into a common shape, deduplicated by URL, and returned as a ranked list with titles and snippets.
Stage 2: LLM Validation
Not every search result is relevant. The news filter agent — an LLM-based classifier — evaluates each result to determine whether it genuinely relates to the disaster event. This prevents noise from entering the knowledge base.
The agent considers:
- Does the article discuss the same event (matching location, date, type)?
- Is the content substantive or is it a stub/duplicate?
- Is the source credible?
Results that pass validation proceed to crawling. Typically 40-70% of search results are filtered out at this stage.
Stage 3: Content Crawling
Validated URLs are sent to the Web Crawler API for full content extraction. The crawler uses a 4-level strategy fallback chain to maximize extraction success:
- Basic — Standard HTTP fetch with content extraction
- Patient — Extended timeouts and wait-for-content strategies
- Undetected — Browser-based crawling with Xvfb for JavaScript-rendered pages
- Proxy — Proxy-routed requests for geo-restricted or bot-protected content
The crawler returns clean, structured text content stripped of navigation, ads, and boilerplate.
Stage 4: Vector Indexing
Crawled content is split into chunks, embedded using OpenAI's embedding model, and stored in PostgreSQL with the pgVector extension. Each chunk maintains a reference to its source URL, event ID, and metadata.
This enables semantic search — instead of keyword matching, the system finds content based on meaning, which is critical for the RAG features.
Knowledge Base Interface
Screenshot: Knowledge base list view showing indexed articles for a specific event, with titles, sources, dates, content types, and relevance indicators
The knowledge base tab on the event detail page displays all indexed content for that event. Each entry shows:
| Field | Description |
|---|---|
| Title | Article or document headline |
| Source | Publication or organization name |
| URL | Link to the original content |
| Type | News article, SITREP, advisory, report |
| Indexed Date | When the content was crawled and indexed |
| Chunks | Number of vector chunks created |
You can search within the knowledge base using the search bar, which performs semantic search over the indexed content.
AI Summary Generation
AI summaries synthesize all available knowledge about an event into a concise, structured overview.
Screenshot: AI summary generation interface showing a "Generate Summary" button, a loading state with progress indicator, and the resulting summary with sections for situation overview, impact, and response
How Summaries Work
When you request a summary, the system:
- Retrieves the most relevant knowledge base chunks using semantic search
- Includes event measurements, affected countries, and timeline data
- Sends the assembled context to the AI model with structured prompting
- Returns a multi-section summary covering:
- Situation Overview — What happened, where, and when
- Impact Assessment — Affected population, infrastructure damage, casualties
- Response Status — Humanitarian actions, government declarations, aid mobilization
- Key Developments — Recent changes and emerging concerns
- Outlook — Expected trajectory and anticipated needs
Regenerating Summaries
Summaries can be regenerated at any time. This is useful when:
- New articles have been indexed since the last summary
- The event situation has evolved significantly
- You want to incorporate a specific angle or focus
Each regeneration uses the latest available knowledge base content.
RAG Chat
The RAG chat interface allows you to ask free-form questions about any event. Unlike a general-purpose chatbot, answers are grounded in the indexed knowledge base for that specific event.
Screenshot: RAG chat interface showing a conversation thread with user questions and AI responses, with source citations linking to original articles
How RAG Chat Works
- Your question is embedded using the same model used for knowledge base indexing
- Semantic search finds the most relevant chunks from the event's knowledge base
- These chunks are assembled into a context window with the question
- The LLM generates an answer grounded in the retrieved context
- Source citations are included so you can verify claims against original documents
Example Questions
- "What is the latest death toll reported?"
- "Which roads and bridges have been damaged?"
- "What relief supplies have been requested by the government?"
- "Are there any tsunami warnings in effect?"
- "What is the forecast track for the cyclone over the next 48 hours?"
Limitations
RAG chat can only answer questions based on content that has been indexed in the knowledge base. If a topic is not covered by any crawled article, the system will indicate that it does not have sufficient information rather than hallucinating an answer.
Manual Knowledge Base Management
In addition to automated discovery, users can manually add URLs to the knowledge base:
- Paste a URL into the "Add Source" input
- The system crawls the URL and indexes its content
- The new content is immediately available for summaries and RAG chat
This is useful for adding internal documents, unpublished reports, or niche sources that the automated search pipeline may not discover.
Next Steps
- Event Detail View — Full event information and context
- Reports — Generate structured reports using knowledge base content
- Knowledge Discovery (Architecture) — Technical deep dive into the pipeline
Event Detail View
The event detail page presents comprehensive information about a single disaster event, including measurements, timeline, affected areas, linked news, AI summaries, and knowledge base entries.
Reports & Notifications
TerraGuard generates AI-powered structured reports and delivers multi-channel notifications to keep response teams informed about disaster events.