TerraGuard

Crawler API Reference

API reference for the TerraGuard web crawler service, covering async and sync crawling, job polling, strategy levels, and content extraction.

Overview

The web crawler extracts content from URLs using a browser-based crawling engine. It supports async (job-based) and sync (blocking) crawling modes, with a 4-level strategy fallback chain that escalates from basic fetching to proxy-based extraction for difficult sites.

Base URL: /v1

Authentication: X-API-Key header required in production. No auth in local development.

Endpoints

Submit Crawl Job (Async)

POST /v1/crawl

Submits a URL for crawling and returns a job ID for polling. The crawl runs asynchronously in the background.

Request Body

FieldTypeRequiredDescription
urlstringYesURL to crawl
wait_for_selectorstringNoCSS selector to wait for before extracting (for JS-rendered pages)
js_codestringNoJavaScript to execute on the page before extraction
max_strategy_levelintegerNoMaximum strategy level to attempt (1-4, default: 4)

Example

curl -X POST http://localhost:5603/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://reliefweb.int/report/myanmar/earthquake-sitrep",
    "max_strategy_level": 3
  }'

Response

{
  "job_id": "crwl_a1b2c3d4",
  "status": "pending",
  "url": "https://reliefweb.int/report/myanmar/earthquake-sitrep",
  "created_at": "2025-03-28T10:30:00Z"
}

Poll Crawl Job

GET /v1/crawl/{job_id}

Check the status of a crawl job and retrieve results when complete.

ParameterTypeDescription
job_idstringJob ID returned by the crawl submission

Pending Response

{
  "job_id": "crwl_a1b2c3d4",
  "status": "processing",
  "url": "https://reliefweb.int/report/myanmar/earthquake-sitrep",
  "strategy_level": 2,
  "created_at": "2025-03-28T10:30:00Z"
}

Completed Response

{
  "job_id": "crwl_a1b2c3d4",
  "status": "completed",
  "url": "https://reliefweb.int/report/myanmar/earthquake-sitrep",
  "strategy_level": 1,
  "result": {
    "markdown": "# Myanmar Earthquake Situation Report\n\nA 7.7 magnitude earthquake...",
    "fit_markdown": "Myanmar Earthquake Situation Report. A 7.7 magnitude earthquake struck...",
    "html": "<h1>Myanmar Earthquake Situation Report</h1><p>A 7.7 magnitude...</p>",
    "cleaned_html": "<h1>Myanmar Earthquake Situation Report</h1><p>A 7.7 magnitude...</p>",
    "metadata": {
      "title": "Myanmar Earthquake Situation Report #3",
      "description": "Situation update on the Myanmar earthquake response",
      "language": "en",
      "word_count": 2450
    },
    "links": [
      {"url": "https://reliefweb.int/map/myanmar", "text": "View Map"},
      {"url": "https://unocha.org/myanmar", "text": "OCHA Myanmar"}
    ],
    "media": [
      {"url": "https://reliefweb.int/images/map.png", "type": "image", "alt": "Affected area map"}
    ]
  },
  "duration_ms": 3240,
  "created_at": "2025-03-28T10:30:00Z",
  "completed_at": "2025-03-28T10:30:03Z"
}

Failed Response

{
  "job_id": "crwl_a1b2c3d4",
  "status": "failed",
  "url": "https://example.com/blocked-page",
  "strategy_level": 4,
  "error": "All strategy levels exhausted. Last error: connection timeout",
  "duration_ms": 45200,
  "created_at": "2025-03-28T10:30:00Z",
  "completed_at": "2025-03-28T10:30:45Z"
}

Sync Crawl (Blocking)

POST /v1/crawl/sync

Performs a crawl synchronously and returns the result in the response. The request blocks until crawling is complete or times out.

The request body is identical to POST /v1/crawl. The response is the same as a completed poll response, but returned directly without polling.

curl -X POST http://localhost:5603/v1/crawl/sync \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article"}'

Use this for simple pages that crawl quickly. For pages that may take longer (JS-heavy sites, anti-bot protection), prefer the async endpoint to avoid HTTP timeouts.

Health Check

GET /v1/health
{
  "status": "healthy",
  "worker": "connected",
  "active_jobs": 3,
  "max_concurrency": 10
}

Strategy Levels

The crawler uses a 4-level fallback chain. Each level is tried in order until content is successfully extracted or all levels are exhausted.

Loading diagram...
LevelNameDescriptionTypical Duration
1BasicStandard headless browser fetch1-3 seconds
2PatientExtended wait times, JavaScript rendering, retries5-10 seconds
3UndetectedXvfb virtual display, stealth browser flags10-20 seconds
4ProxyRoutes through residential proxy network15-30 seconds

Use the max_strategy_level parameter to cap how far the fallback chain goes. For known simple pages, setting max_strategy_level: 1 speeds up crawling significantly.

Response Content Fields

FieldDescription
markdownFull page content converted to Markdown
fit_markdownCleaned Markdown with boilerplate removed (navigation, footers, ads)
htmlOriginal HTML of the main content area
cleaned_htmlHTML with scripts, styles, and non-content elements stripped
metadataPage metadata (title, description, language, word count)
linksArray of links found in the content
mediaArray of images and media found in the content

Error Responses

{
  "error": "invalid_url",
  "message": "The provided URL is not valid",
  "status": 400
}
Status CodeDescription
400Invalid request (malformed URL, invalid parameters)
401Missing or invalid API key (production only)
404Job ID not found
408Sync crawl timed out
500Internal crawling error
503Worker unavailable or at max concurrency

On this page