API Reference · v1 · v0.5

Introduction

The crawlcrawl API is REST. JSON in, JSON out. Bearer-token auth. Endpoints are versioned (/v1) and we won't break the wire format inside a major version.

Base URL:

https://api.crawlcrawl.com

Public hostname with Let's Encrypt TLS — standard cert verification works, no client tweaks required.

Quickstart

The shortest possible exchange: start a crawl, poll until done, fetch a page.

# export your key (sign up at dashboard.crawlcrawl.com/signup for 1,500 free/month)
export CRAWLCRAWL_KEY="crk_..."

# 1. start a crawl
curl -X POST https://api.crawlcrawl.com/v1/crawls \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com", "max_pages": 50 }'
# → 202 {"id":43,"status":"queued","url":"https://example.com"}

# 2. poll the run
curl https://api.crawlcrawl.com/v1/crawls/43 \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

# 3. when status=done, list pages and fetch one as markdown
curl "https://api.crawlcrawl.com/v1/crawls/43/pages" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

curl "https://api.crawlcrawl.com/v1/pages/195?format=markdown" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

Authentication

Every request needs Authorization: Bearer crk_.... Keys are scoped per project; we store only the SHA-256 hash and show plaintext exactly once at mint time. Never put a key in browser code.

To mint a key: create a free project at dashboard.crawlcrawl.com/signup. Rotate via POST /v1/keys/rotate. Revocation is instant — keys are server-side only.

TLS

HTTPS only — auto-provisioned via Let's Encrypt on api.crawlcrawl.com. No special client config required; standard cert verification works.

Limits

Limits are per-project and configurable per key. The defaults below match what new keys ship with.

LimitDefaultNotes
Concurrent runs5Submitting a 6th while 5 are queued/running returns 429.
max_pages per run100,000Hard cap; the request is rejected at validation.
depth1..50Server-side validation — outside this range returns 400.
concurrency per crawlunset (auto)Sustained throughput on one LXC is ~3 pages/s; bumping concurrency above ~16 mostly queues at the worker pool.
Sustained system throughput~3 pages/sOne LXC, public IP. Workload budget today: 100k–200k pages/day across all projects.

Errors

HTTP status codes are honest. Bodies are JSON in this envelope:

{
  "error": {
    "code":    "invalid_input",
    "message": "depth must be 1..=50"
  }
}
CodeHTTPMeaning
invalid_input400Body failed validation (bad URL, depth out of range, etc.).
unauthorized401Missing or wrong bearer token.
not_found404Resource doesn't exist, or belongs to a different project.
too_many_requests429You hit your concurrent-runs cap. Wait for one to finish.
internal500Something on our side. Will be in our logs; ping us.

POST /v1/crawls — start a crawl

POST /v1/crawls

Enqueues a crawl. Returns 202 immediately — the worker picks it up from the Postgres queue and starts within seconds.

Body — top-level fields

FieldTypeDescription
urlstringRequired. Must start with http:// or https://. Host is checked against the malicious-domain blocklist before queueing.
user_agentstringOptional. If omitted (or starts with InternalCrawler/), we rotate to a fresh real-browser UA via ua_generator.
proxy_urlstring?Optional. Per-request override of project default. Supports http://, https://, socks5://.
webhook_urlstring?Optional. POSTed once when the run reaches done or failed. See Webhooks.

Body — crawl params (FLAT, not nested)

These fields go at the top level of the body — not under a params key. The server uses serde flatten; nested params are silently ignored.

FieldTypeDefaultDescription
max_pagesint1000Hard cap; crawl stops when reached. Range 1..=100_000.
concurrencyint?autoParallel in-flight requests.
depthint5Max link-following depth from seed. Range 1..=50.
delay_msint250Delay between requests per host.
subdomainsboolfalseIf true, follows links into subdomains of the seed host.
respect_robotsbooltrueObey robots.txt.
store_htmlbooltrueIf false, we collect URL/status/bytes/metadata only — useful for cheap site maps.
seed_kindstring?"url"Either "url" (link-following) or "sitemap" (walks sitemap.xml + sitemap-index recursively). In sitemap mode you may pass either the origin (https://example.com, defaults to /sitemap.xml) or the explicit sitemap URL (https://example.com/sitemap-index.xml) — both work. Much faster than link-following on large sites.
headersobject?String→string map attached to every request in the crawl (e.g. auth, custom referer).
cookiesstring?Set-Cookie-style string: k=v; k=v.

Headers

HeaderPurpose
AuthorizationBearer crk_... — required.
Content-Typeapplication/json — required.
Idempotency-KeyOptional but recommended. See Idempotency.

Response — 202 Accepted

{
  "id":     43,
  "status": "queued",
  "url":    "https://example.com"
}

GET /v1/crawls/{id} — run status

GET /v1/crawls/{id}

Poll every 2–5 seconds while status is queued or running. Or skip polling entirely and use a webhook.

{
  "id":            43,
  "url":           "https://aeoniti.com",
  "status":        "running",
  "page_count":    27,
  "error_count":   0,
  "enqueued_at":   "2026-05-08T16:48:36.151Z",
  "started_at":    "2026-05-08T16:48:36.401Z",
  "finished_at":   null,
  "error_message": null,
  "proxy_url":     null
}

statusqueued | running | done | failed | cancelled.

Note: page_count includes pages that returned non-200 statuses (those count as "fetched"). error_count only counts crawler-side failures: DNS, TLS, connect timeout.

DELETE /v1/crawls/{id} — cancel and cascade

DELETE /v1/crawls/{id}

Cancels if running, then cascades — all pages rows for the run are removed. Returns 204 No Content. Subsequent GETs return 404.

GET /v1/crawls/{id}/pages — list pages

GET /v1/crawls/{id}/pages?limit=100&offset=0&status=200
QueryDefaultNotes
limit100Max 1000.
offset0Standard offset paging.
statusFilter by HTTP status (e.g. only 200s).
{
  "items": [
    { "id": 195, "url": "https://aeoniti.com/pricing",
      "status": 200, "bytes": 13650,
      "fetched_at": "2026-05-08T16:48:40.595Z" }
  ],
  "next_offset": 14
}

Save the global id per row — that's the path parameter for fetching content.

GET /v1/pages/{id} — fetch one page

GET /v1/pages/{id}?format=markdown

id is the global page id from the list endpoint, not a 1-based index within a run.

formatReturns
htmlRaw HTML (decompressed) + metadata. Default.
markdownClean markdown via fast_html2md, link-preserving + metadata.
articlellm_readability boilerplate-stripped article_text + article_html + metadata.
bothhtml + markdown + article_text + article_html + metadata.

Always returns id, url, status, bytes, fetched_at. metadata shape:

{
  "title":       "Pricing — AEONiti",
  "description": "...",
  "canonical":   "https://aeoniti.com/pricing",
  "og":      { "title": "...", "image": "...", "type": "website" },
  "twitter": { "card": "summary_large_image" },
  "json_ld": [ { "@type": "Product" } ]
}

GET /v1/health · /v1/ready

No auth. Use /v1/health for liveness (process up); /v1/ready verifies Postgres reachable + worker heartbeat current.

curl https://api.crawlcrawl.com/v1/health # → 200 "ok"
curl https://api.crawlcrawl.com/v1/ready  # → 200 "ready" or 503 + JSON

Webhooks

If you set webhook_url on a crawl, the worker POSTs once when the run reaches a terminal state (done or failed).

Headers

Content-Type: application/json
X-Crawler-Run-Id: 43

Body

{
  "id":            43,
  "project_id":    2,
  "url":           "https://aeoniti.com",
  "status":        "done",
  "page_count":    14,
  "error_count":   0,
  "started_at":    "...",
  "finished_at":   "...",
  "error_message": null,
  "delivered_at":  "..."
}

Retry policy

  • 5 attempts total.
  • Exponential backoff: 1s, 2s, 4s, 8s.
  • 2xx → status delivered.
  • 4xx → marked dead immediately. Your endpoint rejected it; we don't retry.
  • 5xx / network / timeout → retry. After attempt 5 → dead.

Delivery state persists on the run. You can poll GET /v1/crawls/{id} and read webhook_status, webhook_attempts, webhook_last_error.

No HMAC signing yet — until that ships, treat the webhook body as advisory: re-fetch the run via GET before acting on it.

Idempotency

Pass Idempotency-Key: <your-uuid> on POST to make retries safe. The same key returns the original run (with the same id) instead of creating a duplicate. Keys are scoped per project and persist for 24 hours.

Use this when the network call to start the crawl might retry (queue handlers, lambda invocations, etc.).

Common patterns

Single-page extract (URL → markdown)

{ "url": "https://example.com/post", "max_pages": 1, "depth": 1 }

Then ?format=both on the one resulting page.

Crawl whole site, get clean text only

{
  "url": "https://customer.com",
  "max_pages": 500, "depth": 8,
  "store_html": false
}

Then ?format=article per page → embeddings input.

Fast site map via sitemap.xml

// either form works — origin defaults to /sitemap.xml,
// or pass the explicit URL (handy for /sitemap-index.xml etc.)
{
  "url": "https://customer.com/sitemap.xml",
  "seed_kind": "sitemap",
  "max_pages": 5000,
  "store_html": false
}

Authenticated crawl

{
  "url": "https://app.customer.com",
  "headers": { "Authorization": "Bearer abc123" },
  "cookies": "session=XYZ; csrf=ABC",
  "respect_robots": false
}

Async with webhook (recommended for production)

{
  "url": "https://customer.com",
  "max_pages": 200,
  "concurrency": 4,
  "webhook_url": "https://you.example.com/crawl-done"
}

Known limits

LimitWorkaround
No JS rendering — SPAs that need client-side render won't yield content beyond the shellUse seed_kind: "sitemap" if the site publishes one. Headless-chrome support is on the roadmap.
Self-signed TLS — agents must skip cert verifyWill swap to Let's Encrypt once a hostname points at api.crawlcrawl.com.
Webhooks are unsignedRe-verify by calling GET /v1/crawls/{id} from your handler.
Single LXC, single Postgres — no HADaily pg_dump backups, weekly restore-test drill. R2 off-box backup landing soon.
store_html=false means pages can't be re-fetched as markdown laterDecide upfront per crawl.
404/403/429 from target are pages, not errorsFilter the page list by status if you only want successes.

Roadmap

Things that aren't built yet, in rough priority order:

  • POST /v1/map — domain URL discovery via sitemap + robots
  • HMAC-signed webhooks
  • SSE streaming for live run logs
  • Off-box backup push (Cloudflare R2)
  • Hostname + Let's Encrypt cert
  • JS rendering via headless chrome (feature flag)
  • Schema-driven structured extraction (the /extract endpoint)
  • Self-host bundle (Docker compose + helm chart)

POST /v1/scan — synchronous single-URL scan

POST /v1/scan

Sub-second URL → markdown + signals + AI-bot policy + llms.txt. No queue, no polling.

Body

FieldTypeDefaultDescription
urlstringRequired.
user_agentstring?rotated UACustom UA.
max_age_secondsint?If a stored page row for this URL exists within this many seconds, return it (cache hit).
cloud_modestring?project defaultnone | auto | unblocker | browser.
metadata_onlyboolfalseSkip markdown/article extraction; return only metadata + signals subset. ~5× faster.
only_main_contentboolfalseRun readability first to drop nav/footer/sidebar, then convert to markdown.
include_linksboolfalseReturn full anchor list [{to, anchor, rel, nofollow}].
screenshot_inlineboolfalseInclude base64 PNG (counts as +1 cloud op). Surfaces screenshot_error on failure.

POST /v1/scan/bulk — parallel multi-URL scan

POST/v1/scan/bulk
{ "urls": ["https://a.com", "https://b.com", "..."],
  "concurrency": 8, "max_age_seconds": 3600 }

Up to 100 URLs per call. concurrency max 32. Each result is an object with {url, ok, result|error}.

GET/v1/crawls/{id}/links

Lazy-extracts anchors from stored HTML. Supports ?from_page_id=N to scope to one page, plus ?limit&offset.

GET /v1/crawls/{id}/orphans — pages with no inbound link

GET/v1/crawls/{id}/orphans

Returns crawled pages that no other crawled page links to. Excludes the seed.

GET /v1/crawls/{old}/diff/{new} — crawl diff

GET/v1/crawls/{old_id}/diff/{new_id}

Compares two runs' pages by content_hash. Returns added, removed, changed arrays plus an unchanged count.

{
  "old_run_id": 9123, "new_run_id": 9128,
  "summary": { "added": 2, "removed": 0, "changed": 3, "unchanged": 42 },
  "changed": [
    { "url": "https://x.com/pricing",
      "old_hash": "abc...", "new_hash": "def..." }
  ]
}

POST /v1/cloud/scrape

POST/v1/cloud/scrape
{ "url": "https://protected.example",
  "return_format": "markdown",  // markdown | raw | text | commonmark
  "chrome": true // optional headless render
}

Direct anti-bot scrape. Returns content, cost_usd, elapsed_ms.

POST /v1/cloud/crawl

POST/v1/cloud/crawl

Multi-page anti-bot crawl. Body: {url, limit (max 500), return_format, chrome}.

POST/v1/cloud/search

SERP-style search. Body: {query, limit (max 50), return_format}.

POST/v1/cloud/links

Outbound link extraction from a single URL via anti-bot fetch. Returns URLs found ON the input page (forward direction). Body: {url, limit (max 5000)}. NOT a backlink intelligence tool — for inbound-link data (who links to you across the web) use Ahrefs, Semrush, or Moz; we don't operate a web-scale link graph.

POST /v1/cloud/screenshot

POST/v1/cloud/screenshot

Returns raw PNG bytes. Body: {url}.

GET /v1/cloud/balance

GET/v1/cloud/balance

Account-wide remaining cloud credit balance.

POST /v1/cloud/render

POST/v1/cloud/render

JS-rendered HTML via local headless Chrome with vendor fallback. Body: {url, force_backend?, return_format?}.

POST /v1/cloud/transform

POST/v1/cloud/transform

HTML→markdown (local fast_html2md) or PDF→text (local poppler). Body: {data, input_kind, return_format}. Runs locally; no spider.cloud credit charged.

POST /v1/cloud/unblock

POST/v1/cloud/unblock

Heavyweight anti-bot via spider.cloud unblocker (residential proxies, captcha solving). Body: {url, return_format}. Higher cost-per-call than scrape/render.

POST /v1/cloud/fetch/{domain}

POST/v1/cloud/fetch/{domain}

Call a per-domain AI-configured scraper. Returns structured payload. Path includes target domain and optional path suffix.

GET /v1/cloud/scrapers

GET/v1/cloud/scrapers

Lists AI-configured per-domain scrapers available on the vendor account.

POST /v1/signup — self-serve signup (public)

POST/v1/signup

The only endpoint that does not use a Bearer key. Public, rate-limited (3 attempts/IP/24h + 100 successful/day globally), requires a Cloudflare Turnstile token in production.

Free tier auto-mints and emails the key. Paid tiers record an intent and return a checkout_url.

GET /v1/billing/checkout

GET/v1/billing/checkout

Public. Creates a Stripe Checkout session for an existing paid signup intent. Returns {checkout_url, mock, intent_id, tier}.

POST /v1/billing/webhook

POST/v1/billing/webhook

Stripe-only. Verifies Stripe-Signature. On checkout.session.completed, provisions the project, marks intent fulfilled, emails the key.

GET /v1/billing/status

GET/v1/billing/status

Auth-required. Returns tier, Stripe IDs, subscription status, and billing_mode (live or mock).

cron field — recurring monitors

Add cron to your POST /v1/crawls body to register a recurring monitor instead of a one-shot run. Standard 5-field UTC cron expression.

{
  "url": "https://competitor.com/pricing",
  "max_pages": 1,
  "cron": "0 */6 * * *",                // every 6 hours
  "webhook_url": "https://yourapp.com/changes",
  "webhook_events": ["crawl.diff_detected"],
  "return_only_changed": true
}
# → 201 { monitor_id, schedule, next_fires_at, url, info }

Monitor cap per tier (3 / 25 / 200). Tick fires at the closest UTC schedule match. previous_run_id is auto-linked between consecutive ticks.

webhook_events

Array of event types to deliver. Defaults to ["crawl.done"].

EventWhen it fires
crawl.doneRun reaches terminal state (done or failed).
crawl.diff_detectedRecurring monitor: previous run exists, diff is non-zero. Body includes diff: {added, removed, changed, unchanged} + previous_run_id.

return_only_changed

For recurring monitors only. When true, skip webhook delivery if the diff vs previous run shows zero changes. webhook_status is set to 'suppressed_no_change'. First run in a chain (no previous) sets 'suppressed_no_baseline'.

GET /v1/crons — list active monitors

GET/v1/crons

Returns each monitor's id, schedule, url, enabled, last_fired_at, next_fires_at, last_run_id.

PATCH /v1/crons/{id} — edit a monitor

PATCH/v1/crons/{id}

Body: {schedule?, enabled?}. Pause/resume or change cadence without recreating.

DELETE /v1/crons/{id} — remove a monitor

DELETE/v1/crons/{id}

GET /v1/usage — current period usage

GET/v1/usage

Returns tier name, configured caps, today and month-to-date counters with percentages, and seconds until UTC midnight reset.

GET /v1/usage/history

GET/v1/usage/history?days=30

Daily buckets for the last N days (max 365).

GET /v1/keys — list API keys

GET/v1/keys

Hash prefix only — plaintext is never recoverable. Includes created_at, last_used_at, revoked_at, expires_at.

POST /v1/keys/rotate

POST/v1/keys/rotate
{ "label": "ci-runner",
  "grace_seconds": 86400      // existing keys remain valid for this window
}
# → 200 { api_key, prefix, label, grace_seconds }
# Plaintext shown ONCE.

DELETE /v1/keys/{prefix}

DELETE/v1/keys/{prefix}

Refuses if it would leave the project with no active key (returns 409).

GET /v1/logs — audit feed

GET/v1/logs?limit=100&status_min=400&status_max=499

Last N requests for this project with method, path, status, duration_ms, client_ip, request_id.

GET /v1/webhook/secret

GET/v1/webhook/secret

Returns the project's HMAC-SHA256 secret in hex. Lazy-generated on first call. Use to verify X-Crawler-Signature on incoming webhooks.

GET /v1/robots-policy

GET/v1/robots-policy?url=https://example.com

Returns parsed AI-bot policy + raw robots.txt + llms.txt for the host. Cached per-host (TTL 1h). Cheap.

POST /v1/llms-txt-build

POST/v1/llms-txt-build

Crawl a domain, return a properly-formatted llms.txt file ready to publish. Synchronous (waits for crawl up to wait_seconds; default 60, max 180).

{ "url": "https://customer.com",
  "max_pages": 30, "site_name": "Customer Inc",
  "summary": "B2B widget company", "wait_seconds": 60 }
# → 200 text/plain  (the llms.txt content)
# → 408 Request Timeout if crawl exceeds wait_seconds (poll /v1/crawls/{run_id})

GET /v1/health/cloud

GET/v1/health/cloud

Returns whether anti-bot routing is configured, current account balance, and this project's last 24h cloud usage.

Tooling

Two artifacts are kept in sync with this reference. Use either to skip the boilerplate.

ArtifactUse it forDownload
postman.json Postman / Insomnia / Bruno collection. Pre-populated requests with sample bodies and {{base_url}} + {{api_key}} variables. Download
openapi.json OpenAPI 3.1 spec. Auto-generate client libs (Python / Node / Go / Rust) via openapi-generator, register as a ChatGPT custom GPT action, or import into any tool that speaks OpenAPI. Download

Last updated 2026-05-10. API version v0.5.0.