Web Crawler
Go API server with Python crawl4ai worker featuring a 4-level strategy fallback chain for reliable web content extraction from news sites and official reports.
Overview
The tg-web-crawler-api is a two-process service that extracts web page content for the TerraGuard knowledge base. A Go API server (port 4003) accepts crawl requests and manages an async job queue, while a Python worker (port 4004) performs the actual browser-based crawling using the crawl4ai SDK.
When a URL is submitted, the crawler attempts extraction through a 4-level strategy chain, escalating from fast headless browsing to anti-detection techniques with residential proxies. This ensures reliable content extraction even from sites with aggressive bot protection.
Architecture
Key Features
4-Level Strategy Chain
Each crawl attempt starts at Level 1 and escalates on failure. The Python worker implements the actual crawling logic for each level:
| Level | Strategy | Timeout | Description |
|---|---|---|---|
| 1 | Basic | 30s | Fast headless Chrome with stealth settings. Works for most standard news sites. |
| 2 | Patient | 60s | Waits for network idle and scans the full page. Handles JavaScript-heavy SPAs and lazy-loaded content. Adds a 3-second delay. |
| 3 | Undetected | 90s | Runs Chrome with headless=False behind Xvfb (virtual display). Uses the UndetectedAdapter with magic mode to bypass bot detection. |
| 4 | Proxy | 90s | Level 3 configuration plus a residential proxy. Last resort for sites that block data center IPs. |
Async Job Queue
The Go server maintains a bounded-concurrency job queue:
MAX_CONCURRENT_CRAWLScontrols how many crawls run in parallel (default: 3)MAX_QUEUE_SIZElimits the pending queue to prevent memory exhaustion (default: 100)- Each job has a unique ID for status polling
- Results are stored in an in-memory thread-safe map
Sync and Async Modes
The crawler supports two request modes:
- Async (default) --
POST /v1/crawlreturns immediately with ajob_id. The client pollsGET /v1/crawl/{job_id}until the status changes tocompletedorfailed. - Sync -- The request blocks until the crawl completes or times out, returning the result directly.
Content Validation
The crawler validates extracted content before accepting it:
- Minimum word count (
MIN_WORD_COUNT, default 10) -- Rejects pages that yield too little text, which usually indicates a bot-blocked page or paywall. - Content type validation ensures only text/HTML content is processed.
Rich Output
Successful crawls return:
- Markdown -- Clean markdown representation of the page content
- HTML -- Raw HTML of the main content area
- Metadata -- Page title, description, author, publish date
- Links -- All links found on the page
- Media -- Images and other media references
Configuration
| Variable | Description | Default |
|---|---|---|
PORT | Go API server port | 8091 |
LOG_LEVEL | Logging level | info |
REQUEST_TIMEOUT_SECONDS | Global HTTP request timeout | 120 |
WORKER_URL | Python crawl4ai worker URL | http://localhost:8092 |
WORKER_TIMEOUT_SECONDS | Timeout for worker requests | 120 |
MAX_CONCURRENT_CRAWLS | Maximum parallel crawl jobs | 3 |
MAX_QUEUE_SIZE | Maximum pending queue size | 100 |
MIN_WORD_COUNT | Minimum words for valid content | 10 |
FALLBACK_ENABLED | Enable strategy chain fallback | true |
PROXY_HOST | Residential proxy host (Level 4) | optional |
PROXY_PORT | Residential proxy port | optional |
PROXY_USER | Proxy username | optional |
PROXY_PASS | Proxy password | optional |
API_KEY | API key for authentication | optional |
API Endpoints
| Method | Path | Description |
|---|---|---|
POST | /v1/crawl | Submit a URL for crawling |
GET | /v1/crawl/{job_id} | Get job status and result |
GET | /v1/health | Service health check |
Crawl Request
{
"url": "https://reliefweb.int/report/example",
"sync": false,
"strategy_level": 1
}Crawl Response (completed)
{
"job_id": "abc123",
"url": "https://reliefweb.int/report/example",
"status": "completed",
"result": {
"markdown": "# Article Title\n\nContent here...",
"html": "<h1>Article Title</h1><p>Content here...</p>",
"metadata": {
"title": "Article Title",
"description": "Brief summary",
"author": "Author Name",
"published_date": "2026-03-28"
},
"links": ["https://example.com/related"],
"media": ["https://example.com/image.jpg"],
"word_count": 1542
},
"strategy_used": "basic",
"duration_ms": 4200
}Directory Structure
tg-web-crawler-api/
├── cmd/
│ └── server/
│ └── main.go # Entry point
├── internal/
│ ├── api/
│ │ ├── handler.go # HTTP handlers (submit, get job, health)
│ │ ├── middleware.go # Auth, CORS, logging
│ │ └── response.go # Response helpers
│ ├── config/
│ │ └── config.go # Environment configuration
│ ├── crawler/
│ │ └── router.go # Crawl orchestration with strategy fallback
│ ├── jobstore/
│ │ └── store.go # Thread-safe in-memory job store
│ ├── model/
│ │ └── model.go # CrawlRequest, CrawlResponse, Job
│ └── strategy/
│ ├── strategy.go # Strategy interface
│ ├── basic.go # Level 1: Basic (30s)
│ ├── patient.go # Level 2: Patient (60s)
│ ├── undetected.go # Level 3: Undetected (90s, Xvfb)
│ └── proxy.go # Level 4: Proxy (90s, residential)
├── worker/ # Python crawl4ai worker
│ ├── main.py # FastAPI worker entry point
│ └── requirements.txt
├── pkg/
│ └── crawl4ai/
│ └── client.go # Go client for the Python worker
├── scripts/
├── docker-compose.yml # Go server + Python worker
├── Makefile
└── go.modRunning
# Build and run the Go server
make run
# Run with Docker (includes Python worker)
make docker-up
# Run tests
make test
# Shut down Docker services
make docker-downSearch Layer
API-based web + news search using Serper.dev as the primary provider and Brave Search as the error fallback, called directly from the Backend API.
Frontend
Next.js 15 App Router application with React 19, MapLibre GL maps, TanStack Query, Zustand state management, and Clerk authentication for the TerraGuard dashboard.