Web Crawler

Go API server with Python crawl4ai worker featuring a 4-level strategy fallback chain for reliable web content extraction from news sites and official reports.

Overview

The tg-web-crawler-api is a two-process service that extracts web page content for the TerraGuard knowledge base. A Go API server (port 4003) accepts crawl requests and manages an async job queue, while a Python worker (port 4004) performs the actual browser-based crawling using the crawl4ai SDK.

When a URL is submitted, the crawler attempts extraction through a 4-level strategy chain, escalating from fast headless browsing to anti-detection techniques with residential proxies. This ensures reliable content extraction even from sites with aggressive bot protection.

Level	Strategy	Timeout	Description
1	Basic	30s	Fast headless Chrome with stealth settings. Works for most standard news sites.
2	Patient	60s	Waits for network idle and scans the full page. Handles JavaScript-heavy SPAs and lazy-loaded content. Adds a 3-second delay.
3	Undetected	90s	Runs Chrome with `headless=False` behind Xvfb (virtual display). Uses the `UndetectedAdapter` with magic mode to bypass bot detection.
4	Proxy	90s	Level 3 configuration plus a residential proxy. Last resort for sites that block data center IPs.

Async Job Queue

The Go server maintains a bounded-concurrency job queue:

MAX_CONCURRENT_CRAWLS controls how many crawls run in parallel (default: 3)
MAX_QUEUE_SIZE limits the pending queue to prevent memory exhaustion (default: 100)
Each job has a unique ID for status polling
Results are stored in an in-memory thread-safe map

Sync and Async Modes

The crawler supports two request modes:

Async (default) -- POST /v1/crawl returns immediately with a job_id. The client polls GET /v1/crawl/{job_id} until the status changes to completed or failed.
Sync -- The request blocks until the crawl completes or times out, returning the result directly.

Content Validation

The crawler validates extracted content before accepting it:

Minimum word count (MIN_WORD_COUNT, default 10) -- Rejects pages that yield too little text, which usually indicates a bot-blocked page or paywall.
Content type validation ensures only text/HTML content is processed.

Rich Output

Successful crawls return:

Markdown -- Clean markdown representation of the page content
HTML -- Raw HTML of the main content area
Metadata -- Page title, description, author, publish date
Links -- All links found on the page
Media -- Images and other media references

Configuration

Variable	Description	Default
`PORT`	Go API server port	`8091`
`LOG_LEVEL`	Logging level	`info`
`REQUEST_TIMEOUT_SECONDS`	Global HTTP request timeout	`120`
`WORKER_URL`	Python crawl4ai worker URL	`http://localhost:8092`
`WORKER_TIMEOUT_SECONDS`	Timeout for worker requests	`120`
`MAX_CONCURRENT_CRAWLS`	Maximum parallel crawl jobs	`3`
`MAX_QUEUE_SIZE`	Maximum pending queue size	`100`
`MIN_WORD_COUNT`	Minimum words for valid content	`10`
`FALLBACK_ENABLED`	Enable strategy chain fallback	`true`
`PROXY_HOST`	Residential proxy host (Level 4)	optional
`PROXY_PORT`	Residential proxy port	optional
`PROXY_USER`	Proxy username	optional
`PROXY_PASS`	Proxy password	optional
`API_KEY`	API key for authentication	optional

API Endpoints

Method	Path	Description
`POST`	`/v1/crawl`	Submit a URL for crawling
`GET`	`/v1/crawl/{job_id}`	Get job status and result
`GET`	`/v1/health`	Service health check

Crawl Request

{
  "url": "https://reliefweb.int/report/example",
  "sync": false,
  "strategy_level": 1
}

Crawl Response (completed)

{
  "job_id": "abc123",
  "url": "https://reliefweb.int/report/example",
  "status": "completed",
  "result": {
    "markdown": "# Article Title\n\nContent here...",
    "html": "<h1>Article Title</h1><p>Content here...</p>",
    "metadata": {
      "title": "Article Title",
      "description": "Brief summary",
      "author": "Author Name",
      "published_date": "2026-03-28"
    },
    "links": ["https://example.com/related"],
    "media": ["https://example.com/image.jpg"],
    "word_count": 1542
  },
  "strategy_used": "basic",
  "duration_ms": 4200
}

Directory Structure

tg-web-crawler-api/
├── cmd/
│   └── server/
│       └── main.go             # Entry point
├── internal/
│   ├── api/
│   │   ├── handler.go          # HTTP handlers (submit, get job, health)
│   │   ├── middleware.go       # Auth, CORS, logging
│   │   └── response.go        # Response helpers
│   ├── config/
│   │   └── config.go          # Environment configuration
│   ├── crawler/
│   │   └── router.go          # Crawl orchestration with strategy fallback
│   ├── jobstore/
│   │   └── store.go           # Thread-safe in-memory job store
│   ├── model/
│   │   └── model.go           # CrawlRequest, CrawlResponse, Job
│   └── strategy/
│       ├── strategy.go         # Strategy interface
│       ├── basic.go            # Level 1: Basic (30s)
│       ├── patient.go          # Level 2: Patient (60s)
│       ├── undetected.go       # Level 3: Undetected (90s, Xvfb)
│       └── proxy.go            # Level 4: Proxy (90s, residential)
├── worker/                     # Python crawl4ai worker
│   ├── main.py                 # FastAPI worker entry point
│   └── requirements.txt
├── pkg/
│   └── crawl4ai/
│       └── client.go           # Go client for the Python worker
├── scripts/
├── docker-compose.yml          # Go server + Python worker
├── Makefile
└── go.mod

Running

# Build and run the Go server
make run

# Run with Docker (includes Python worker)
make docker-up

# Run tests
make test

# Shut down Docker services
make docker-down