Introduction
The crawlcrawl API is REST. JSON in, JSON out. Bearer-token auth. Endpoints are versioned (/v1) and we won't break the wire format inside a major version.
Base URL:
Public hostname with Let's Encrypt TLS — standard cert verification works, no client tweaks required.
Quickstart
The shortest possible exchange: start a crawl, poll until done, fetch a page.
# export your key (sign up at dashboard.crawlcrawl.com/signup for 1,500 free/month) export CRAWLCRAWL_KEY="crk_..." # 1. start a crawl curl -X POST https://api.crawlcrawl.com/v1/crawls \ -H "Authorization: Bearer $CRAWLCRAWL_KEY" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com", "max_pages": 50 }' # → 202 {"id":43,"status":"queued","url":"https://example.com"} # 2. poll the run curl https://api.crawlcrawl.com/v1/crawls/43 \ -H "Authorization: Bearer $CRAWLCRAWL_KEY" # 3. when status=done, list pages and fetch one as markdown curl "https://api.crawlcrawl.com/v1/crawls/43/pages" \ -H "Authorization: Bearer $CRAWLCRAWL_KEY" curl "https://api.crawlcrawl.com/v1/pages/195?format=markdown" \ -H "Authorization: Bearer $CRAWLCRAWL_KEY"
Authentication
Every request needs Authorization: Bearer crk_.... Keys are scoped per project; we store only the SHA-256 hash and show plaintext exactly once at mint time. Never put a key in browser code.
To mint a key: create a free project at dashboard.crawlcrawl.com/signup. Rotate via POST /v1/keys/rotate. Revocation is instant — keys are server-side only.
TLS
HTTPS only — auto-provisioned via Let's Encrypt on api.crawlcrawl.com. No special client config required; standard cert verification works.
Limits
Limits are per-project and configurable per key. The defaults below match what new keys ship with.
| Limit | Default | Notes |
|---|---|---|
| Concurrent runs | 5 | Submitting a 6th while 5 are queued/running returns 429. |
| max_pages per run | 100,000 | Hard cap; the request is rejected at validation. |
| depth | 1..50 | Server-side validation — outside this range returns 400. |
| concurrency per crawl | unset (auto) | Sustained throughput on one LXC is ~3 pages/s; bumping concurrency above ~16 mostly queues at the worker pool. |
| Sustained system throughput | ~3 pages/s | One LXC, public IP. Workload budget today: 100k–200k pages/day across all projects. |
Errors
HTTP status codes are honest. Bodies are JSON in this envelope:
{ "error": { "code": "invalid_input", "message": "depth must be 1..=50" } }
| Code | HTTP | Meaning |
|---|---|---|
| invalid_input | 400 | Body failed validation (bad URL, depth out of range, etc.). |
| unauthorized | 401 | Missing or wrong bearer token. |
| not_found | 404 | Resource doesn't exist, or belongs to a different project. |
| too_many_requests | 429 | You hit your concurrent-runs cap. Wait for one to finish. |
| internal | 500 | Something on our side. Will be in our logs; ping us. |
POST /v1/crawls — start a crawl
Enqueues a crawl. Returns 202 immediately — the worker picks it up from the Postgres queue and starts within seconds.
Body — top-level fields
| Field | Type | Description |
|---|---|---|
| url | string | Required. Must start with http:// or https://. Host is checked against the malicious-domain blocklist before queueing. |
| user_agent | string | Optional. If omitted (or starts with InternalCrawler/), we rotate to a fresh real-browser UA via ua_generator. |
| proxy_url | string? | Optional. Per-request override of project default. Supports http://, https://, socks5://. |
| webhook_url | string? | Optional. POSTed once when the run reaches done or failed. See Webhooks. |
Body — crawl params (FLAT, not nested)
These fields go at the top level of the body — not under a params key. The server uses serde flatten; nested params are silently ignored.
| Field | Type | Default | Description |
|---|---|---|---|
| max_pages | int | 1000 | Hard cap; crawl stops when reached. Range 1..=100_000. |
| concurrency | int? | auto | Parallel in-flight requests. |
| depth | int | 5 | Max link-following depth from seed. Range 1..=50. |
| delay_ms | int | 250 | Delay between requests per host. |
| subdomains | bool | false | If true, follows links into subdomains of the seed host. |
| respect_robots | bool | true | Obey robots.txt. |
| store_html | bool | true | If false, we collect URL/status/bytes/metadata only — useful for cheap site maps. |
| seed_kind | string? | "url" | Either "url" (link-following) or "sitemap" (walks sitemap.xml + sitemap-index recursively). In sitemap mode you may pass either the origin (https://example.com, defaults to /sitemap.xml) or the explicit sitemap URL (https://example.com/sitemap-index.xml) — both work. Much faster than link-following on large sites. |
| headers | object? | — | String→string map attached to every request in the crawl (e.g. auth, custom referer). |
| cookies | string? | — | Set-Cookie-style string: k=v; k=v. |
Headers
| Header | Purpose |
|---|---|
| Authorization | Bearer crk_... — required. |
| Content-Type | application/json — required. |
| Idempotency-Key | Optional but recommended. See Idempotency. |
Response — 202 Accepted
{ "id": 43, "status": "queued", "url": "https://example.com" }
GET /v1/crawls/{id} — run status
Poll every 2–5 seconds while status is queued or running. Or skip polling entirely and use a webhook.
{ "id": 43, "url": "https://aeoniti.com", "status": "running", "page_count": 27, "error_count": 0, "enqueued_at": "2026-05-08T16:48:36.151Z", "started_at": "2026-05-08T16:48:36.401Z", "finished_at": null, "error_message": null, "proxy_url": null }
status ∈ queued | running | done | failed | cancelled.
Note: page_count includes pages that returned non-200 statuses (those count as "fetched"). error_count only counts crawler-side failures: DNS, TLS, connect timeout.
DELETE /v1/crawls/{id} — cancel and cascade
Cancels if running, then cascades — all pages rows for the run are removed. Returns 204 No Content. Subsequent GETs return 404.
GET /v1/crawls/{id}/pages — list pages
| Query | Default | Notes |
|---|---|---|
| limit | 100 | Max 1000. |
| offset | 0 | Standard offset paging. |
| status | — | Filter by HTTP status (e.g. only 200s). |
{ "items": [ { "id": 195, "url": "https://aeoniti.com/pricing", "status": 200, "bytes": 13650, "fetched_at": "2026-05-08T16:48:40.595Z" } ], "next_offset": 14 }
Save the global id per row — that's the path parameter for fetching content.
GET /v1/pages/{id} — fetch one page
id is the global page id from the list endpoint, not a 1-based index within a run.
| format | Returns |
|---|---|
| html | Raw HTML (decompressed) + metadata. Default. |
| markdown | Clean markdown via fast_html2md, link-preserving + metadata. |
| article | llm_readability boilerplate-stripped article_text + article_html + metadata. |
| both | html + markdown + article_text + article_html + metadata. |
Always returns id, url, status, bytes, fetched_at. metadata shape:
{ "title": "Pricing — AEONiti", "description": "...", "canonical": "https://aeoniti.com/pricing", "og": { "title": "...", "image": "...", "type": "website" }, "twitter": { "card": "summary_large_image" }, "json_ld": [ { "@type": "Product" } ] }
GET /v1/health · /v1/ready
No auth. Use /v1/health for liveness (process up); /v1/ready verifies Postgres reachable + worker heartbeat current.
curl https://api.crawlcrawl.com/v1/health # → 200 "ok" curl https://api.crawlcrawl.com/v1/ready # → 200 "ready" or 503 + JSON
Webhooks
If you set webhook_url on a crawl, the worker POSTs once when the run reaches a terminal state (done or failed).
Headers
Content-Type: application/json X-Crawler-Run-Id: 43
Body
{ "id": 43, "project_id": 2, "url": "https://aeoniti.com", "status": "done", "page_count": 14, "error_count": 0, "started_at": "...", "finished_at": "...", "error_message": null, "delivered_at": "..." }
Retry policy
- 5 attempts total.
- Exponential backoff: 1s, 2s, 4s, 8s.
- 2xx → status
delivered. - 4xx → marked
deadimmediately. Your endpoint rejected it; we don't retry. - 5xx / network / timeout → retry. After attempt 5 →
dead.
Delivery state persists on the run. You can poll GET /v1/crawls/{id} and read webhook_status, webhook_attempts, webhook_last_error.
No HMAC signing yet — until that ships, treat the webhook body as advisory: re-fetch the run via GET before acting on it.
Idempotency
Pass Idempotency-Key: <your-uuid> on POST to make retries safe. The same key returns the original run (with the same id) instead of creating a duplicate. Keys are scoped per project and persist for 24 hours.
Use this when the network call to start the crawl might retry (queue handlers, lambda invocations, etc.).
Common patterns
Single-page extract (URL → markdown)
{ "url": "https://example.com/post", "max_pages": 1, "depth": 1 }
Then ?format=both on the one resulting page.
Crawl whole site, get clean text only
{ "url": "https://customer.com", "max_pages": 500, "depth": 8, "store_html": false }
Then ?format=article per page → embeddings input.
Fast site map via sitemap.xml
// either form works — origin defaults to /sitemap.xml, // or pass the explicit URL (handy for /sitemap-index.xml etc.) { "url": "https://customer.com/sitemap.xml", "seed_kind": "sitemap", "max_pages": 5000, "store_html": false }
Authenticated crawl
{ "url": "https://app.customer.com", "headers": { "Authorization": "Bearer abc123" }, "cookies": "session=XYZ; csrf=ABC", "respect_robots": false }
Async with webhook (recommended for production)
{ "url": "https://customer.com", "max_pages": 200, "concurrency": 4, "webhook_url": "https://you.example.com/crawl-done" }
Known limits
| Limit | Workaround |
|---|---|
| No JS rendering — SPAs that need client-side render won't yield content beyond the shell | Use seed_kind: "sitemap" if the site publishes one. Headless-chrome support is on the roadmap. |
| Self-signed TLS — agents must skip cert verify | Will swap to Let's Encrypt once a hostname points at api.crawlcrawl.com. |
| Webhooks are unsigned | Re-verify by calling GET /v1/crawls/{id} from your handler. |
| Single LXC, single Postgres — no HA | Daily pg_dump backups, weekly restore-test drill. R2 off-box backup landing soon. |
store_html=false means pages can't be re-fetched as markdown later | Decide upfront per crawl. |
| 404/403/429 from target are pages, not errors | Filter the page list by status if you only want successes. |
Roadmap
Things that aren't built yet, in rough priority order:
POST /v1/map— domain URL discovery via sitemap + robots- HMAC-signed webhooks
- SSE streaming for live run logs
- Off-box backup push (Cloudflare R2)
- Hostname + Let's Encrypt cert
- JS rendering via headless chrome (feature flag)
- Schema-driven structured extraction (the
/extractendpoint) - Self-host bundle (Docker compose + helm chart)
POST /v1/scan — synchronous single-URL scan
Sub-second URL → markdown + signals + AI-bot policy + llms.txt. No queue, no polling.
Body
| Field | Type | Default | Description |
|---|---|---|---|
| url | string | — | Required. |
| user_agent | string? | rotated UA | Custom UA. |
| max_age_seconds | int? | — | If a stored page row for this URL exists within this many seconds, return it (cache hit). |
| cloud_mode | string? | project default | none | auto | unblocker | browser. |
| metadata_only | bool | false | Skip markdown/article extraction; return only metadata + signals subset. ~5× faster. |
| only_main_content | bool | false | Run readability first to drop nav/footer/sidebar, then convert to markdown. |
| include_links | bool | false | Return full anchor list [{to, anchor, rel, nofollow}]. |
| screenshot_inline | bool | false | Include base64 PNG (counts as +1 cloud op). Surfaces screenshot_error on failure. |
POST /v1/scan/bulk — parallel multi-URL scan
{ "urls": ["https://a.com", "https://b.com", "..."], "concurrency": 8, "max_age_seconds": 3600 }
Up to 100 URLs per call. concurrency max 32. Each result is an object with {url, ok, result|error}.
GET /v1/crawls/{id}/links — link graph
Lazy-extracts anchors from stored HTML. Supports ?from_page_id=N to scope to one page, plus ?limit&offset.
GET /v1/crawls/{id}/orphans — pages with no inbound link
Returns crawled pages that no other crawled page links to. Excludes the seed.
GET /v1/crawls/{old}/diff/{new} — crawl diff
Compares two runs' pages by content_hash. Returns added, removed, changed arrays plus an unchanged count.
{ "old_run_id": 9123, "new_run_id": 9128, "summary": { "added": 2, "removed": 0, "changed": 3, "unchanged": 42 }, "changed": [ { "url": "https://x.com/pricing", "old_hash": "abc...", "new_hash": "def..." } ] }
POST /v1/cloud/scrape
{ "url": "https://protected.example", "return_format": "markdown", // markdown | raw | text | commonmark "chrome": true // optional headless render }
Direct anti-bot scrape. Returns content, cost_usd, elapsed_ms.
POST /v1/cloud/crawl
Multi-page anti-bot crawl. Body: {url, limit (max 500), return_format, chrome}.
POST /v1/cloud/search
SERP-style search. Body: {query, limit (max 50), return_format}.
POST /v1/cloud/links
Outbound link extraction from a single URL via anti-bot fetch. Returns URLs found ON the input page (forward direction). Body: {url, limit (max 5000)}. NOT a backlink intelligence tool — for inbound-link data (who links to you across the web) use Ahrefs, Semrush, or Moz; we don't operate a web-scale link graph.
POST /v1/cloud/screenshot
Returns raw PNG bytes. Body: {url}.
GET /v1/cloud/balance
Account-wide remaining cloud credit balance.
POST /v1/cloud/render
JS-rendered HTML via local headless Chrome with vendor fallback. Body: {url, force_backend?, return_format?}.
POST /v1/cloud/transform
HTML→markdown (local fast_html2md) or PDF→text (local poppler). Body: {data, input_kind, return_format}. Runs locally; no spider.cloud credit charged.
POST /v1/cloud/unblock
Heavyweight anti-bot via spider.cloud unblocker (residential proxies, captcha solving). Body: {url, return_format}. Higher cost-per-call than scrape/render.
POST /v1/cloud/fetch/{domain}
Call a per-domain AI-configured scraper. Returns structured payload. Path includes target domain and optional path suffix.
GET /v1/cloud/scrapers
Lists AI-configured per-domain scrapers available on the vendor account.
POST /v1/signup — self-serve signup (public)
The only endpoint that does not use a Bearer key. Public, rate-limited (3 attempts/IP/24h + 100 successful/day globally), requires a Cloudflare Turnstile token in production.
Free tier auto-mints and emails the key. Paid tiers record an intent and return a checkout_url.
GET /v1/billing/checkout
Public. Creates a Stripe Checkout session for an existing paid signup intent. Returns {checkout_url, mock, intent_id, tier}.
POST /v1/billing/webhook
Stripe-only. Verifies Stripe-Signature. On checkout.session.completed, provisions the project, marks intent fulfilled, emails the key.
GET /v1/billing/status
Auth-required. Returns tier, Stripe IDs, subscription status, and billing_mode (live or mock).
cron field — recurring monitors
Add cron to your POST /v1/crawls body to register a recurring monitor instead of a one-shot run. Standard 5-field UTC cron expression.
{ "url": "https://competitor.com/pricing", "max_pages": 1, "cron": "0 */6 * * *", // every 6 hours "webhook_url": "https://yourapp.com/changes", "webhook_events": ["crawl.diff_detected"], "return_only_changed": true } # → 201 { monitor_id, schedule, next_fires_at, url, info }
Monitor cap per tier (3 / 25 / 200). Tick fires at the closest UTC schedule match. previous_run_id is auto-linked between consecutive ticks.
webhook_events
Array of event types to deliver. Defaults to ["crawl.done"].
| Event | When it fires |
|---|---|
| crawl.done | Run reaches terminal state (done or failed). |
| crawl.diff_detected | Recurring monitor: previous run exists, diff is non-zero. Body includes diff: {added, removed, changed, unchanged} + previous_run_id. |
return_only_changed
For recurring monitors only. When true, skip webhook delivery if the diff vs previous run shows zero changes. webhook_status is set to 'suppressed_no_change'. First run in a chain (no previous) sets 'suppressed_no_baseline'.
GET /v1/crons — list active monitors
Returns each monitor's id, schedule, url, enabled, last_fired_at, next_fires_at, last_run_id.
PATCH /v1/crons/{id} — edit a monitor
Body: {schedule?, enabled?}. Pause/resume or change cadence without recreating.
DELETE /v1/crons/{id} — remove a monitor
GET /v1/usage — current period usage
Returns tier name, configured caps, today and month-to-date counters with percentages, and seconds until UTC midnight reset.
GET /v1/usage/history
Daily buckets for the last N days (max 365).
GET /v1/keys — list API keys
Hash prefix only — plaintext is never recoverable. Includes created_at, last_used_at, revoked_at, expires_at.
POST /v1/keys/rotate
{ "label": "ci-runner", "grace_seconds": 86400 // existing keys remain valid for this window } # → 200 { api_key, prefix, label, grace_seconds } # Plaintext shown ONCE.
DELETE /v1/keys/{prefix}
Refuses if it would leave the project with no active key (returns 409).
GET /v1/logs — audit feed
Last N requests for this project with method, path, status, duration_ms, client_ip, request_id.
GET /v1/webhook/secret
Returns the project's HMAC-SHA256 secret in hex. Lazy-generated on first call. Use to verify X-Crawler-Signature on incoming webhooks.
GET /v1/robots-policy
Returns parsed AI-bot policy + raw robots.txt + llms.txt for the host. Cached per-host (TTL 1h). Cheap.
POST /v1/llms-txt-build
Crawl a domain, return a properly-formatted llms.txt file ready to publish. Synchronous (waits for crawl up to wait_seconds; default 60, max 180).
{ "url": "https://customer.com", "max_pages": 30, "site_name": "Customer Inc", "summary": "B2B widget company", "wait_seconds": 60 } # → 200 text/plain (the llms.txt content) # → 408 Request Timeout if crawl exceeds wait_seconds (poll /v1/crawls/{run_id})
GET /v1/health/cloud
Returns whether anti-bot routing is configured, current account balance, and this project's last 24h cloud usage.
Tooling
Two artifacts are kept in sync with this reference. Use either to skip the boilerplate.
| Artifact | Use it for | Download |
|---|---|---|
| postman.json | Postman / Insomnia / Bruno collection. Pre-populated requests with sample bodies and {{base_url}} + {{api_key}} variables. |
Download |
| openapi.json | OpenAPI 3.1 spec. Auto-generate client libs (Python / Node / Go / Rust) via openapi-generator, register as a ChatGPT custom GPT action, or import into any tool that speaks OpenAPI. |
Download |
Last updated 2026-05-10. API version v0.5.0.