Crawler API Reference

API reference for the TerraGuard web crawler service, covering async and sync crawling, job polling, strategy levels, and content extraction.

Overview

The web crawler extracts content from URLs using a browser-based crawling engine. It supports async (job-based) and sync (blocking) crawling modes, with a 4-level strategy fallback chain that escalates from basic fetching to proxy-based extraction for difficult sites.

Base URL: /v1

Authentication: X-API-Key header required in production. No auth in local development.

Endpoints

Submit Crawl Job (Async)

POST /v1/crawl

Submits a URL for crawling and returns a job ID for polling. The crawl runs asynchronously in the background.

Request Body

Field	Type	Required	Description
`url`	string	Yes	URL to crawl
`wait_for_selector`	string	No	CSS selector to wait for before extracting (for JS-rendered pages)
`js_code`	string	No	JavaScript to execute on the page before extraction
`max_strategy_level`	integer	No	Maximum strategy level to attempt (1-4, default: 4)

Example

curl -X POST http://localhost:5603/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://reliefweb.int/report/myanmar/earthquake-sitrep",
    "max_strategy_level": 3
  }'

Response

{
  "job_id": "crwl_a1b2c3d4",
  "status": "pending",
  "url": "https://reliefweb.int/report/myanmar/earthquake-sitrep",
  "created_at": "2025-03-28T10:30:00Z"
}

Poll Crawl Job

GET /v1/crawl/{job_id}

Check the status of a crawl job and retrieve results when complete.

Parameter	Type	Description
`job_id`	string	Job ID returned by the crawl submission

Pending Response

{
  "job_id": "crwl_a1b2c3d4",
  "status": "processing",
  "url": "https://reliefweb.int/report/myanmar/earthquake-sitrep",
  "strategy_level": 2,
  "created_at": "2025-03-28T10:30:00Z"
}

Completed Response

{
  "job_id": "crwl_a1b2c3d4",
  "status": "completed",
  "url": "https://reliefweb.int/report/myanmar/earthquake-sitrep",
  "strategy_level": 1,
  "result": {
    "markdown": "# Myanmar Earthquake Situation Report\n\nA 7.7 magnitude earthquake...",
    "fit_markdown": "Myanmar Earthquake Situation Report. A 7.7 magnitude earthquake struck...",
    "html": "<h1>Myanmar Earthquake Situation Report</h1><p>A 7.7 magnitude...</p>",
    "cleaned_html": "<h1>Myanmar Earthquake Situation Report</h1><p>A 7.7 magnitude...</p>",
    "metadata": {
      "title": "Myanmar Earthquake Situation Report #3",
      "description": "Situation update on the Myanmar earthquake response",
      "language": "en",
      "word_count": 2450
    },
    "links": [
      {"url": "https://reliefweb.int/map/myanmar", "text": "View Map"},
      {"url": "https://unocha.org/myanmar", "text": "OCHA Myanmar"}
    ],
    "media": [
      {"url": "https://reliefweb.int/images/map.png", "type": "image", "alt": "Affected area map"}
    ]
  },
  "duration_ms": 3240,
  "created_at": "2025-03-28T10:30:00Z",
  "completed_at": "2025-03-28T10:30:03Z"
}

Failed Response

{
  "job_id": "crwl_a1b2c3d4",
  "status": "failed",
  "url": "https://example.com/blocked-page",
  "strategy_level": 4,
  "error": "All strategy levels exhausted. Last error: connection timeout",
  "duration_ms": 45200,
  "created_at": "2025-03-28T10:30:00Z",
  "completed_at": "2025-03-28T10:30:45Z"
}

Sync Crawl (Blocking)

POST /v1/crawl/sync

Performs a crawl synchronously and returns the result in the response. The request blocks until crawling is complete or times out.

The request body is identical to POST /v1/crawl. The response is the same as a completed poll response, but returned directly without polling.

curl -X POST http://localhost:5603/v1/crawl/sync \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article"}'

Use this for simple pages that crawl quickly. For pages that may take longer (JS-heavy sites, anti-bot protection), prefer the async endpoint to avoid HTTP timeouts.

Health Check

GET /v1/health

{
  "status": "healthy",
  "worker": "connected",
  "active_jobs": 3,
  "max_concurrency": 10
}

Strategy Levels

The crawler uses a 4-level fallback chain. Each level is tried in order until content is successfully extracted or all levels are exhausted.

Loading diagram...

Level	Name	Description	Typical Duration
1	Basic	Standard headless browser fetch	1-3 seconds
2	Patient	Extended wait times, JavaScript rendering, retries	5-10 seconds
3	Undetected	Xvfb virtual display, stealth browser flags	10-20 seconds
4	Proxy	Routes through residential proxy network	15-30 seconds

Use the max_strategy_level parameter to cap how far the fallback chain goes. For known simple pages, setting max_strategy_level: 1 speeds up crawling significantly.

Response Content Fields

Field	Description
`markdown`	Full page content converted to Markdown
`fit_markdown`	Cleaned Markdown with boilerplate removed (navigation, footers, ads)
`html`	Original HTML of the main content area
`cleaned_html`	HTML with scripts, styles, and non-content elements stripped
`metadata`	Page metadata (title, description, language, word count)
`links`	Array of links found in the content
`media`	Array of images and media found in the content

Error Responses

{
  "error": "invalid_url",
  "message": "The provided URL is not valid",
  "status": 400
}

Status Code	Description
400	Invalid request (malformed URL, invalid parameters)
401	Missing or invalid API key (production only)
404	Job ID not found
408	Sync crawl timed out
500	Internal crawling error
503	Worker unavailable or at max concurrency

On this page