TerraGuard

Web Crawler

Go API server with Python crawl4ai worker featuring a 4-level strategy fallback chain for reliable web content extraction from news sites and official reports.

Overview

The tg-web-crawler-api is a two-process service that extracts web page content for the TerraGuard knowledge base. A Go API server (port 4003) accepts crawl requests and manages an async job queue, while a Python worker (port 4004) performs the actual browser-based crawling using the crawl4ai SDK.

When a URL is submitted, the crawler attempts extraction through a 4-level strategy chain, escalating from fast headless browsing to anti-detection techniques with residential proxies. This ensures reliable content extraction even from sites with aggressive bot protection.

Architecture

Loading diagram...

Key Features

4-Level Strategy Chain

Each crawl attempt starts at Level 1 and escalates on failure. The Python worker implements the actual crawling logic for each level:

LevelStrategyTimeoutDescription
1Basic30sFast headless Chrome with stealth settings. Works for most standard news sites.
2Patient60sWaits for network idle and scans the full page. Handles JavaScript-heavy SPAs and lazy-loaded content. Adds a 3-second delay.
3Undetected90sRuns Chrome with headless=False behind Xvfb (virtual display). Uses the UndetectedAdapter with magic mode to bypass bot detection.
4Proxy90sLevel 3 configuration plus a residential proxy. Last resort for sites that block data center IPs.

Async Job Queue

The Go server maintains a bounded-concurrency job queue:

  • MAX_CONCURRENT_CRAWLS controls how many crawls run in parallel (default: 3)
  • MAX_QUEUE_SIZE limits the pending queue to prevent memory exhaustion (default: 100)
  • Each job has a unique ID for status polling
  • Results are stored in an in-memory thread-safe map

Sync and Async Modes

The crawler supports two request modes:

  • Async (default) -- POST /v1/crawl returns immediately with a job_id. The client polls GET /v1/crawl/{job_id} until the status changes to completed or failed.
  • Sync -- The request blocks until the crawl completes or times out, returning the result directly.

Content Validation

The crawler validates extracted content before accepting it:

  • Minimum word count (MIN_WORD_COUNT, default 10) -- Rejects pages that yield too little text, which usually indicates a bot-blocked page or paywall.
  • Content type validation ensures only text/HTML content is processed.

Rich Output

Successful crawls return:

  • Markdown -- Clean markdown representation of the page content
  • HTML -- Raw HTML of the main content area
  • Metadata -- Page title, description, author, publish date
  • Links -- All links found on the page
  • Media -- Images and other media references

Configuration

VariableDescriptionDefault
PORTGo API server port8091
LOG_LEVELLogging levelinfo
REQUEST_TIMEOUT_SECONDSGlobal HTTP request timeout120
WORKER_URLPython crawl4ai worker URLhttp://localhost:8092
WORKER_TIMEOUT_SECONDSTimeout for worker requests120
MAX_CONCURRENT_CRAWLSMaximum parallel crawl jobs3
MAX_QUEUE_SIZEMaximum pending queue size100
MIN_WORD_COUNTMinimum words for valid content10
FALLBACK_ENABLEDEnable strategy chain fallbacktrue
PROXY_HOSTResidential proxy host (Level 4)optional
PROXY_PORTResidential proxy portoptional
PROXY_USERProxy usernameoptional
PROXY_PASSProxy passwordoptional
API_KEYAPI key for authenticationoptional

API Endpoints

MethodPathDescription
POST/v1/crawlSubmit a URL for crawling
GET/v1/crawl/{job_id}Get job status and result
GET/v1/healthService health check

Crawl Request

{
  "url": "https://reliefweb.int/report/example",
  "sync": false,
  "strategy_level": 1
}

Crawl Response (completed)

{
  "job_id": "abc123",
  "url": "https://reliefweb.int/report/example",
  "status": "completed",
  "result": {
    "markdown": "# Article Title\n\nContent here...",
    "html": "<h1>Article Title</h1><p>Content here...</p>",
    "metadata": {
      "title": "Article Title",
      "description": "Brief summary",
      "author": "Author Name",
      "published_date": "2026-03-28"
    },
    "links": ["https://example.com/related"],
    "media": ["https://example.com/image.jpg"],
    "word_count": 1542
  },
  "strategy_used": "basic",
  "duration_ms": 4200
}

Directory Structure

tg-web-crawler-api/
├── cmd/
│   └── server/
│       └── main.go             # Entry point
├── internal/
│   ├── api/
│   │   ├── handler.go          # HTTP handlers (submit, get job, health)
│   │   ├── middleware.go       # Auth, CORS, logging
│   │   └── response.go        # Response helpers
│   ├── config/
│   │   └── config.go          # Environment configuration
│   ├── crawler/
│   │   └── router.go          # Crawl orchestration with strategy fallback
│   ├── jobstore/
│   │   └── store.go           # Thread-safe in-memory job store
│   ├── model/
│   │   └── model.go           # CrawlRequest, CrawlResponse, Job
│   └── strategy/
│       ├── strategy.go         # Strategy interface
│       ├── basic.go            # Level 1: Basic (30s)
│       ├── patient.go          # Level 2: Patient (60s)
│       ├── undetected.go       # Level 3: Undetected (90s, Xvfb)
│       └── proxy.go            # Level 4: Proxy (90s, residential)
├── worker/                     # Python crawl4ai worker
│   ├── main.py                 # FastAPI worker entry point
│   └── requirements.txt
├── pkg/
│   └── crawl4ai/
│       └── client.go           # Go client for the Python worker
├── scripts/
├── docker-compose.yml          # Go server + Python worker
├── Makefile
└── go.mod

Running

# Build and run the Go server
make run

# Run with Docker (includes Python worker)
make docker-up

# Run tests
make test

# Shut down Docker services
make docker-down

On this page