Crawler API Reference
API reference for the TerraGuard web crawler service, covering async and sync crawling, job polling, strategy levels, and content extraction.
Overview
The web crawler extracts content from URLs using a browser-based crawling engine. It supports async (job-based) and sync (blocking) crawling modes, with a 4-level strategy fallback chain that escalates from basic fetching to proxy-based extraction for difficult sites.
Base URL: /v1
Authentication: X-API-Key header required in production. No auth in local development.
Endpoints
Submit Crawl Job (Async)
POST /v1/crawlSubmits a URL for crawling and returns a job ID for polling. The crawl runs asynchronously in the background.
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
url | string | Yes | URL to crawl |
wait_for_selector | string | No | CSS selector to wait for before extracting (for JS-rendered pages) |
js_code | string | No | JavaScript to execute on the page before extraction |
max_strategy_level | integer | No | Maximum strategy level to attempt (1-4, default: 4) |
Example
curl -X POST http://localhost:5603/v1/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://reliefweb.int/report/myanmar/earthquake-sitrep",
"max_strategy_level": 3
}'Response
{
"job_id": "crwl_a1b2c3d4",
"status": "pending",
"url": "https://reliefweb.int/report/myanmar/earthquake-sitrep",
"created_at": "2025-03-28T10:30:00Z"
}Poll Crawl Job
GET /v1/crawl/{job_id}Check the status of a crawl job and retrieve results when complete.
| Parameter | Type | Description |
|---|---|---|
job_id | string | Job ID returned by the crawl submission |
Pending Response
{
"job_id": "crwl_a1b2c3d4",
"status": "processing",
"url": "https://reliefweb.int/report/myanmar/earthquake-sitrep",
"strategy_level": 2,
"created_at": "2025-03-28T10:30:00Z"
}Completed Response
{
"job_id": "crwl_a1b2c3d4",
"status": "completed",
"url": "https://reliefweb.int/report/myanmar/earthquake-sitrep",
"strategy_level": 1,
"result": {
"markdown": "# Myanmar Earthquake Situation Report\n\nA 7.7 magnitude earthquake...",
"fit_markdown": "Myanmar Earthquake Situation Report. A 7.7 magnitude earthquake struck...",
"html": "<h1>Myanmar Earthquake Situation Report</h1><p>A 7.7 magnitude...</p>",
"cleaned_html": "<h1>Myanmar Earthquake Situation Report</h1><p>A 7.7 magnitude...</p>",
"metadata": {
"title": "Myanmar Earthquake Situation Report #3",
"description": "Situation update on the Myanmar earthquake response",
"language": "en",
"word_count": 2450
},
"links": [
{"url": "https://reliefweb.int/map/myanmar", "text": "View Map"},
{"url": "https://unocha.org/myanmar", "text": "OCHA Myanmar"}
],
"media": [
{"url": "https://reliefweb.int/images/map.png", "type": "image", "alt": "Affected area map"}
]
},
"duration_ms": 3240,
"created_at": "2025-03-28T10:30:00Z",
"completed_at": "2025-03-28T10:30:03Z"
}Failed Response
{
"job_id": "crwl_a1b2c3d4",
"status": "failed",
"url": "https://example.com/blocked-page",
"strategy_level": 4,
"error": "All strategy levels exhausted. Last error: connection timeout",
"duration_ms": 45200,
"created_at": "2025-03-28T10:30:00Z",
"completed_at": "2025-03-28T10:30:45Z"
}Sync Crawl (Blocking)
POST /v1/crawl/syncPerforms a crawl synchronously and returns the result in the response. The request blocks until crawling is complete or times out.
The request body is identical to POST /v1/crawl. The response is the same as a completed poll response, but returned directly without polling.
curl -X POST http://localhost:5603/v1/crawl/sync \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article"}'Use this for simple pages that crawl quickly. For pages that may take longer (JS-heavy sites, anti-bot protection), prefer the async endpoint to avoid HTTP timeouts.
Health Check
GET /v1/health{
"status": "healthy",
"worker": "connected",
"active_jobs": 3,
"max_concurrency": 10
}Strategy Levels
The crawler uses a 4-level fallback chain. Each level is tried in order until content is successfully extracted or all levels are exhausted.
| Level | Name | Description | Typical Duration |
|---|---|---|---|
| 1 | Basic | Standard headless browser fetch | 1-3 seconds |
| 2 | Patient | Extended wait times, JavaScript rendering, retries | 5-10 seconds |
| 3 | Undetected | Xvfb virtual display, stealth browser flags | 10-20 seconds |
| 4 | Proxy | Routes through residential proxy network | 15-30 seconds |
Use the max_strategy_level parameter to cap how far the fallback chain goes. For known simple pages, setting max_strategy_level: 1 speeds up crawling significantly.
Response Content Fields
| Field | Description |
|---|---|
markdown | Full page content converted to Markdown |
fit_markdown | Cleaned Markdown with boilerplate removed (navigation, footers, ads) |
html | Original HTML of the main content area |
cleaned_html | HTML with scripts, styles, and non-content elements stripped |
metadata | Page metadata (title, description, language, word count) |
links | Array of links found in the content |
media | Array of images and media found in the content |
Error Responses
{
"error": "invalid_url",
"message": "The provided URL is not valid",
"status": 400
}| Status Code | Description |
|---|---|
| 400 | Invalid request (malformed URL, invalid parameters) |
| 401 | Missing or invalid API key (production only) |
| 404 | Job ID not found |
| 408 | Sync crawl timed out |
| 500 | Internal crawling error |
| 503 | Worker unavailable or at max concurrency |
Search API Reference
The standalone Search API service has been retired. Web and news search now run inside the Backend API as an internal module using Serper.dev (primary) and Brave Search (fallback).
GeoPop API Reference
API reference for the TerraGuard GeoPop service, providing reverse geocoding, population analysis, exposure assessment, and country lookups.