API Reference · v1 · v0.5

Introduction

The crawlcrawl API is REST. JSON in, JSON out. Bearer-token auth. Endpoints are versioned (/v1) and we won't break the wire format inside a major version.

Base URL:

https://api.crawlcrawl.com

Public hostname with Let's Encrypt TLS — standard cert verification works, no client tweaks required.

Quickstart

The shortest possible exchange: start a crawl, poll until done, fetch a page.

# export your key (sign up at dashboard.crawlcrawl.com/signup for 1,500 free/month)
export CRAWLCRAWL_KEY="crk_..."

# 1. start a crawl
curl -X POST https://api.crawlcrawl.com/v1/crawls \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "url": "https://example.com", "max_pages": 50 }'
# → 202 {"id":43,"status":"queued","url":"https://example.com"}

# 2. poll the run
curl https://api.crawlcrawl.com/v1/crawls/43 \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

# 3. when status=done, list pages and fetch one as markdown
curl "https://api.crawlcrawl.com/v1/crawls/43/pages" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

curl "https://api.crawlcrawl.com/v1/pages/195?format=markdown" \
  -H "Authorization: Bearer $CRAWLCRAWL_KEY"

Authentication

Every request needs Authorization: Bearer crk_.... Keys are scoped per project; we store only the SHA-256 hash and show plaintext exactly once at mint time. Never put a key in browser code.

To mint a key: create a free project at dashboard.crawlcrawl.com/signup. Rotate via POST /v1/keys/rotate. Revocation is instant — keys are server-side only.

TLS

HTTPS only — auto-provisioned via Let's Encrypt on api.crawlcrawl.com. No special client config required; standard cert verification works.

Limits

Limits are per-project and configurable per key. The defaults below match what new keys ship with.

Limit	Default	Notes
Concurrent runs	5	Submitting a 6th while 5 are queued/running returns 429.
max_pages per run	100,000	Hard cap; the request is rejected at validation.
depth	1..50	Server-side validation — outside this range returns 400.
concurrency per crawl	unset (auto)	Sustained throughput on one LXC is ~3 pages/s; bumping concurrency above ~16 mostly queues at the worker pool.
Sustained system throughput	~3 pages/s	One LXC, public IP. Workload budget today: 100k–200k pages/day across all projects.

Errors

HTTP status codes are honest. Bodies are JSON in this envelope:

{
  "error": {
    "code":    "invalid_input",
    "message": "depth must be 1..=50"
  }
}

Code	HTTP	Meaning
invalid_input	400	Body failed validation (bad URL, depth out of range, etc.).
unauthorized	401	Missing or wrong bearer token.
not_found	404	Resource doesn't exist, or belongs to a different project.
too_many_requests	429	You hit your concurrent-runs cap. Wait for one to finish.
internal	500	Something on our side. Will be in our logs; ping us.

POST /v1/crawls — start a crawl

POST /v1/crawls

Enqueues a crawl. Returns 202 immediately — the worker picks it up from the Postgres queue and starts within seconds.

Body — top-level fields

Field	Type	Description
url	string	Required. Must start with `http://` or `https://`. Host is checked against the malicious-domain blocklist before queueing.
user_agent	string	Optional. If omitted (or starts with `InternalCrawler/`), we rotate to a fresh real-browser UA via `ua_generator`.
proxy_url	string?	Optional. Per-request override of project default. Supports `http://`, `https://`, `socks5://`.
webhook_url	string?	Optional. POSTed once when the run reaches `done` or `failed`. See Webhooks.

Body — crawl params (FLAT, not nested)

These fields go at the top level of the body — not under a params key. The server uses serde flatten; nested params are silently ignored.

Field	Type	Default	Description
max_pages	int	1000	Hard cap; crawl stops when reached. Range 1..=100_000.
concurrency	int?	auto	Parallel in-flight requests.
depth	int	5	Max link-following depth from seed. Range 1..=50.
delay_ms	int	250	Delay between requests per host.
subdomains	bool	false	If true, follows links into subdomains of the seed host.
respect_robots	bool	true	Obey robots.txt.
store_html	bool	true	If false, we collect URL/status/bytes/metadata only — useful for cheap site maps.
seed_kind	string?	"url"	Either `"url"` (link-following) or `"sitemap"` (walks `sitemap.xml` + sitemap-index recursively). In sitemap mode you may pass either the origin (`https://example.com`, defaults to `/sitemap.xml`) or the explicit sitemap URL (`https://example.com/sitemap-index.xml`) — both work. Much faster than link-following on large sites.
headers	object?	—	String→string map attached to every request in the crawl (e.g. auth, custom referer).
cookies	string?	—	Set-Cookie-style string: `k=v; k=v`.

Headers

Header	Purpose
Authorization	`Bearer crk_...` — required.
Content-Type	`application/json` — required.
Idempotency-Key	Optional but recommended. See Idempotency.

Response — 202 Accepted

{
  "id":     43,
  "status": "queued",
  "url":    "https://example.com"
}

GET /v1/crawls/{id} — run status

GET /v1/crawls/{id}

Poll every 2–5 seconds while status is queued or running. Or skip polling entirely and use a webhook.

{
  "id":            43,
  "url":           "https://aeoniti.com",
  "status":        "running",
  "page_count":    27,
  "error_count":   0,
  "enqueued_at":   "2026-05-08T16:48:36.151Z",
  "started_at":    "2026-05-08T16:48:36.401Z",
  "finished_at":   null,
  "error_message": null,
  "proxy_url":     null
}

status ∈ queued | running | done | failed | cancelled.

Note: page_count includes pages that returned non-200 statuses (those count as "fetched"). error_count only counts crawler-side failures: DNS, TLS, connect timeout.

DELETE /v1/crawls/{id} — cancel and cascade

DELETE /v1/crawls/{id}

Cancels if running, then cascades — all pages rows for the run are removed. Returns 204 No Content. Subsequent GETs return 404.

GET /v1/crawls/{id}/pages — list pages

GET /v1/crawls/{id}/pages?limit=100&offset=0&status=200

Query	Default	Notes
limit	100	Max 1000.
offset	0	Standard offset paging.
status	—	Filter by HTTP status (e.g. only 200s).

{
  "items": [
    { "id": 195, "url": "https://aeoniti.com/pricing",
      "status": 200, "bytes": 13650,
      "fetched_at": "2026-05-08T16:48:40.595Z" }
  ],
  "next_offset": 14
}

Save the global id per row — that's the path parameter for fetching content.

GET /v1/pages/{id} — fetch one page

GET /v1/pages/{id}?format=markdown

id is the global page id from the list endpoint, not a 1-based index within a run.

format	Returns
html	Raw HTML (decompressed) + metadata. Default.
markdown	Clean markdown via fast_html2md, link-preserving + metadata.
article	llm_readability boilerplate-stripped `article_text` + `article_html` + metadata.
both	html + markdown + article_text + article_html + metadata.

Always returns id, url, status, bytes, fetched_at. metadata shape:

{
  "title":       "Pricing — AEONiti",
  "description": "...",
  "canonical":   "https://aeoniti.com/pricing",
  "og":      { "title": "...", "image": "...", "type": "website" },
  "twitter": { "card": "summary_large_image" },
  "json_ld": [ { "@type": "Product" } ]
}

GET /v1/health · /v1/ready

No auth. Use /v1/health for liveness (process up); /v1/ready verifies Postgres reachable + worker heartbeat current.

curl https://api.crawlcrawl.com/v1/health # → 200 "ok"
curl https://api.crawlcrawl.com/v1/ready  # → 200 "ready" or 503 + JSON

Webhooks

If you set webhook_url on a crawl, the worker POSTs once when the run reaches a terminal state (done or failed).

Headers

Content-Type: application/json
X-Crawler-Run-Id: 43

Body

{
  "id":            43,
  "project_id":    2,
  "url":           "https://aeoniti.com",
  "status":        "done",
  "page_count":    14,
  "error_count":   0,
  "started_at":    "...",
  "finished_at":   "...",
  "error_message": null,
  "delivered_at":  "..."
}

Retry policy

5 attempts total.
Exponential backoff: 1s, 2s, 4s, 8s.
2xx → status delivered.
4xx → marked dead immediately. Your endpoint rejected it; we don't retry.
5xx / network / timeout → retry. After attempt 5 → dead.

Delivery state persists on the run. You can poll GET /v1/crawls/{id} and read webhook_status, webhook_attempts, webhook_last_error.

No HMAC signing yet — until that ships, treat the webhook body as advisory: re-fetch the run via GET before acting on it.

Idempotency

Pass Idempotency-Key: <your-uuid> on POST to make retries safe. The same key returns the original run (with the same id) instead of creating a duplicate. Keys are scoped per project and persist for 24 hours.

Use this when the network call to start the crawl might retry (queue handlers, lambda invocations, etc.).

Common patterns

Single-page extract (URL → markdown)

{ "url": "https://example.com/post", "max_pages": 1, "depth": 1 }

Then ?format=both on the one resulting page.

Crawl whole site, get clean text only

{
  "url": "https://customer.com",
  "max_pages": 500, "depth": 8,
  "store_html": false
}

Then ?format=article per page → embeddings input.

Fast site map via sitemap.xml

// either form works — origin defaults to /sitemap.xml,
// or pass the explicit URL (handy for /sitemap-index.xml etc.)
{
  "url": "https://customer.com/sitemap.xml",
  "seed_kind": "sitemap",
  "max_pages": 5000,
  "store_html": false
}

Authenticated crawl

{
  "url": "https://app.customer.com",
  "headers": { "Authorization": "Bearer abc123" },
  "cookies": "session=XYZ; csrf=ABC",
  "respect_robots": false
}

Async with webhook (recommended for production)

{
  "url": "https://customer.com",
  "max_pages": 200,
  "concurrency": 4,
  "webhook_url": "https://you.example.com/crawl-done"
}

Known limits

Limit	Workaround
No JS rendering — SPAs that need client-side render won't yield content beyond the shell	Use `seed_kind: "sitemap"` if the site publishes one. Headless-chrome support is on the roadmap.
Self-signed TLS — agents must skip cert verify	Will swap to Let's Encrypt once a hostname points at `api.crawlcrawl.com`.
Webhooks are unsigned	Re-verify by calling `GET /v1/crawls/{id}` from your handler.
Single LXC, single Postgres — no HA	Daily pg_dump backups, weekly restore-test drill. R2 off-box backup landing soon.
`store_html=false` means pages can't be re-fetched as markdown later	Decide upfront per crawl.
404/403/429 from target are pages, not errors	Filter the page list by `status` if you only want successes.

Roadmap

Things that aren't built yet, in rough priority order:

POST /v1/map — domain URL discovery via sitemap + robots
HMAC-signed webhooks
SSE streaming for live run logs
Off-box backup push (Cloudflare R2)
Hostname + Let's Encrypt cert
JS rendering via headless chrome (feature flag)
Schema-driven structured extraction (the /extract endpoint)
Self-host bundle (Docker compose + helm chart)

POST /v1/scan — synchronous single-URL scan

POST /v1/scan

Sub-second URL → markdown + signals + AI-bot policy + llms.txt. No queue, no polling.

Body

Field	Type	Default	Description
url	string	—	Required.
user_agent	string?	rotated UA	Custom UA.
max_age_seconds	int?	—	If a stored page row for this URL exists within this many seconds, return it (cache hit).
cloud_mode	string?	project default	`none` \| `auto` \| `unblocker` \| `browser`.
metadata_only	bool	false	Skip markdown/article extraction; return only metadata + signals subset. ~5× faster.
only_main_content	bool	false	Run readability first to drop nav/footer/sidebar, then convert to markdown.
include_links	bool	false	Return full anchor list `[{to, anchor, rel, nofollow}]`.
screenshot_inline	bool	false	Include base64 PNG (counts as +1 cloud op). Surfaces `screenshot_error` on failure.

POST /v1/scan/bulk — parallel multi-URL scan

POST/v1/scan/bulk

{ "urls": ["https://a.com", "https://b.com", "..."],
  "concurrency": 8, "max_age_seconds": 3600 }

Up to 100 URLs per call. concurrency max 32. Each result is an object with {url, ok, result|error}.

GET /v1/crawls/{id}/links — link graph

GET/v1/crawls/{id}/links

Lazy-extracts anchors from stored HTML. Supports ?from_page_id=N to scope to one page, plus ?limit&offset.

GET /v1/crawls/{id}/orphans — pages with no inbound link

GET/v1/crawls/{id}/orphans

Returns crawled pages that no other crawled page links to. Excludes the seed.

GET /v1/crawls/{old}/diff/{new} — crawl diff

GET/v1/crawls/{old_id}/diff/{new_id}

Compares two runs' pages by content_hash. Returns added, removed, changed arrays plus an unchanged count.

{
  "old_run_id": 9123, "new_run_id": 9128,
  "summary": { "added": 2, "removed": 0, "changed": 3, "unchanged": 42 },
  "changed": [
    { "url": "https://x.com/pricing",
      "old_hash": "abc...", "new_hash": "def..." }
  ]
}

POST /v1/cloud/scrape

POST/v1/cloud/scrape

{ "url": "https://protected.example",
  "return_format": "markdown",  // markdown | raw | text | commonmark
  "chrome": true // optional headless render
}

Direct anti-bot scrape. Returns content, cost_usd, elapsed_ms.

POST /v1/cloud/crawl

POST/v1/cloud/crawl

Multi-page anti-bot crawl. Body: {url, limit (max 500), return_format, chrome}.

POST /v1/cloud/search

POST/v1/cloud/search

SERP-style search. Body: {query, limit (max 50), return_format}.

POST /v1/cloud/links

POST/v1/cloud/links

Outbound link extraction from a single URL via anti-bot fetch. Returns URLs found ON the input page (forward direction). Body: {url, limit (max 5000)}. NOT a backlink intelligence tool — for inbound-link data (who links to you across the web) use Ahrefs, Semrush, or Moz; we don't operate a web-scale link graph.

POST /v1/cloud/screenshot

POST/v1/cloud/screenshot

Returns raw PNG bytes. Body: {url}.

GET /v1/cloud/balance

GET/v1/cloud/balance

Account-wide remaining cloud credit balance.

POST /v1/cloud/render

POST/v1/cloud/render

JS-rendered HTML via local headless Chrome with vendor fallback. Body: {url, force_backend?, return_format?}.

POST /v1/cloud/transform

POST/v1/cloud/transform

HTML→markdown (local fast_html2md) or PDF→text (local poppler). Body: {data, input_kind, return_format}. Runs locally; no spider.cloud credit charged.

POST /v1/cloud/unblock

POST/v1/cloud/unblock

Heavyweight anti-bot via spider.cloud unblocker (residential proxies, captcha solving). Body: {url, return_format}. Higher cost-per-call than scrape/render.

POST /v1/cloud/fetch/{domain}

POST/v1/cloud/fetch/{domain}

Call a per-domain AI-configured scraper. Returns structured payload. Path includes target domain and optional path suffix.

GET /v1/cloud/scrapers

GET/v1/cloud/scrapers

Lists AI-configured per-domain scrapers available on the vendor account.

POST/v1/signup

The only endpoint that does not use a Bearer key. Public, rate-limited (3 attempts/IP/24h + 100 successful/day globally), requires a Cloudflare Turnstile token in production.

Free tier auto-mints and emails the key. Paid tiers record an intent and return a checkout_url.

GET /v1/billing/checkout

GET/v1/billing/checkout

Public. Creates a Stripe Checkout session for an existing paid signup intent. Returns {checkout_url, mock, intent_id, tier}.

POST /v1/billing/webhook

POST/v1/billing/webhook

Stripe-only. Verifies Stripe-Signature. On checkout.session.completed, provisions the project, marks intent fulfilled, emails the key.

GET /v1/billing/status

GET/v1/billing/status

Auth-required. Returns tier, Stripe IDs, subscription status, and billing_mode (live or mock).

cron field — recurring monitors

Add cron to your POST /v1/crawls body to register a recurring monitor instead of a one-shot run. Standard 5-field UTC cron expression.

{
  "url": "https://competitor.com/pricing",
  "max_pages": 1,
  "cron": "0 */6 * * *",                // every 6 hours
  "webhook_url": "https://yourapp.com/changes",
  "webhook_events": ["crawl.diff_detected"],
  "return_only_changed": true
}
# → 201 { monitor_id, schedule, next_fires_at, url, info }

Monitor cap per tier (3 / 25 / 200). Tick fires at the closest UTC schedule match. previous_run_id is auto-linked between consecutive ticks.

webhook_events

Array of event types to deliver. Defaults to ["crawl.done"].

Event	When it fires
crawl.done	Run reaches terminal state (done or failed).
crawl.diff_detected	Recurring monitor: previous run exists, diff is non-zero. Body includes `diff: {added, removed, changed, unchanged}` + `previous_run_id`.

return_only_changed

For recurring monitors only. When true, skip webhook delivery if the diff vs previous run shows zero changes. webhook_status is set to 'suppressed_no_change'. First run in a chain (no previous) sets 'suppressed_no_baseline'.

GET /v1/crons — list active monitors

GET/v1/crons

Returns each monitor's id, schedule, url, enabled, last_fired_at, next_fires_at, last_run_id.

PATCH /v1/crons/{id} — edit a monitor

PATCH/v1/crons/{id}

Body: {schedule?, enabled?}. Pause/resume or change cadence without recreating.

DELETE /v1/crons/{id} — remove a monitor

DELETE/v1/crons/{id}

GET /v1/usage — current period usage

GET/v1/usage

Returns tier name, configured caps, today and month-to-date counters with percentages, and seconds until UTC midnight reset.

GET /v1/usage/history

GET/v1/usage/history?days=30

Daily buckets for the last N days (max 365).

GET /v1/keys — list API keys

GET/v1/keys

Hash prefix only — plaintext is never recoverable. Includes created_at, last_used_at, revoked_at, expires_at.

POST /v1/keys/rotate

POST/v1/keys/rotate

{ "label": "ci-runner",
  "grace_seconds": 86400      // existing keys remain valid for this window
}
# → 200 { api_key, prefix, label, grace_seconds }
# Plaintext shown ONCE.

DELETE /v1/keys/{prefix}

DELETE/v1/keys/{prefix}

Refuses if it would leave the project with no active key (returns 409).

GET /v1/logs — audit feed

GET/v1/logs?limit=100&status_min=400&status_max=499

Last N requests for this project with method, path, status, duration_ms, client_ip, request_id.

GET /v1/webhook/secret

GET/v1/webhook/secret

Returns the project's HMAC-SHA256 secret in hex. Lazy-generated on first call. Use to verify X-Crawler-Signature on incoming webhooks.

GET /v1/robots-policy

GET/v1/robots-policy?url=https://example.com

Returns parsed AI-bot policy + raw robots.txt + llms.txt for the host. Cached per-host (TTL 1h). Cheap.

POST /v1/llms-txt-build

POST/v1/llms-txt-build

Crawl a domain, return a properly-formatted llms.txt file ready to publish. Synchronous (waits for crawl up to wait_seconds; default 60, max 180).

{ "url": "https://customer.com",
  "max_pages": 30, "site_name": "Customer Inc",
  "summary": "B2B widget company", "wait_seconds": 60 }
# → 200 text/plain  (the llms.txt content)
# → 408 Request Timeout if crawl exceeds wait_seconds (poll /v1/crawls/{run_id})

GET /v1/health/cloud

GET/v1/health/cloud

Returns whether anti-bot routing is configured, current account balance, and this project's last 24h cloud usage.

Tooling

Two artifacts are kept in sync with this reference. Use either to skip the boilerplate.

Artifact	Use it for	Download
postman.json	Postman / Insomnia / Bruno collection. Pre-populated requests with sample bodies and `{{base_url}}` + `{{api_key}}` variables.	Download
openapi.json	OpenAPI 3.1 spec. Auto-generate client libs (Python / Node / Go / Rust) via `openapi-generator`, register as a ChatGPT custom GPT action, or import into any tool that speaks OpenAPI.	Download

Last updated 2026-05-10. API version v0.5.0.

Introduction

Quickstart

Authentication

TLS

Limits

Errors

POST /v1/crawls — start a crawl

Body — top-level fields

Body — crawl params (FLAT, not nested)

Headers

Response — 202 Accepted

GET /v1/crawls/{id} — run status

DELETE /v1/crawls/{id} — cancel and cascade

GET /v1/crawls/{id}/pages — list pages

GET /v1/pages/{id} — fetch one page

GET /v1/health · /v1/ready

Webhooks

Headers

Body

Retry policy

Idempotency

Common patterns

Single-page extract (URL → markdown)

Crawl whole site, get clean text only

Fast site map via sitemap.xml

Authenticated crawl

Async with webhook (recommended for production)

Known limits

Roadmap

POST /v1/scan — synchronous single-URL scan

Body

POST /v1/scan/bulk — parallel multi-URL scan

GET /v1/crawls/{id}/links — link graph

GET /v1/crawls/{id}/orphans — pages with no inbound link

GET /v1/crawls/{old}/diff/{new} — crawl diff

POST /v1/cloud/scrape

POST /v1/cloud/crawl

POST /v1/cloud/search

POST /v1/cloud/links

POST /v1/cloud/screenshot

GET /v1/cloud/balance

POST /v1/cloud/render

POST /v1/cloud/transform

POST /v1/cloud/unblock

POST /v1/cloud/fetch/{domain}

GET /v1/cloud/scrapers

POST /v1/signup — self-serve signup (public)

GET /v1/billing/checkout

POST /v1/billing/webhook

GET /v1/billing/status

cron field — recurring monitors

webhook_events

return_only_changed

GET /v1/crons — list active monitors

PATCH /v1/crons/{id} — edit a monitor

DELETE /v1/crons/{id} — remove a monitor

GET /v1/usage — current period usage

GET /v1/usage/history

GET /v1/keys — list API keys

POST /v1/keys/rotate

DELETE /v1/keys/{prefix}

GET /v1/logs — audit feed

GET /v1/webhook/secret

GET /v1/robots-policy

POST /v1/llms-txt-build

GET /v1/health/cloud

Tooling