Skip to Content
REST APIScrapers

Scrapers

The full scraper CRUD + run + watch lifecycle. A scraper is a reusable extraction recipe; runs of a scraper produce jobs; jobs produce rows of data (see Data).

Create or update — PUT /api/scraper

Upsert semantics: omit id in the body to create a new scraper, set id to update an existing one. The same endpoint covers both — there is no separate PUT /api/scraper/{id} (try it and you get 405). The body is the full ScraperDTO; partial updates are not supported — always fetch the current state via GET /api/scraper/{id}, mutate, then PUT the whole thing back.

The easiest way to discover a valid ScraperDTO for a given site is to call the Preview endpoint first — it returns one or more candidate scraperDTO payloads you can copy into your PUT body.

PUT /api/scraper Authorization: Bearer <key> Idempotency-Key: <uuid> Content-Type: application/json { "id": null, "name": "competitor-prices", "groupRef": "grp_xyz", "type": "MULTIPLE_PRODUCTS", "sourceConfig": { "url": "https://competitor.com/products", "pagination": { "type": "URL_PAGE_PARAM", "param": "page", "max": 50 } }, "itemsConfig": [ { "name": "title", "selector": "h2.product-title", "kind": "TEXT" }, { "name": "price", "selector": ".price", "kind": "TEXT" }, { "name": "in_stock", "selector": ".add-to-cart", "kind": "EXISTS" } ], "schema": { "id": "sch_..." }, "postProcessRules": [] }
FieldRequiredDescription
idnoOmit to create; set to update an existing scraper (must be yours — cross-tenant id forge is rejected)
nameyesHuman label, unique per customer
groupRefyesThe group’s id this scraper belongs to. See Groups
typeyesSINGLE_PRODUCT / MULTIPLE_PRODUCTS / APPLICATION_LD_JSON / AI_CONF / AMAZON / GOOGLE_SEARCH
sourceConfigyesSource descriptor (url + pagination, or curl). Get the right shape from Preview
itemsConfigconditionalField selectors (CSS / XPath / JSON-path depending on type). Not used by AI_CONF (which uses schema instead)
schemafor AI_CONF only{ id } reference to a customer schema. See Schemas
postProcessRulesnoList of PostProcessRule transformations applied to each scraped row (CURRENCY_CONVERT, ENUM_MAP, REGEX_CLEAN). Each rule reads one field, writes a derived column. See Preview a post-process rule for the per-kind DTO shape + behaviour.
schedulenoCron expression for scheduled runs (also set via PUT /api/scraper/group/{id}/start-type/{startType})

Response — the persisted ScraperDTO with the assigned id.

Errors — 400 (validation failed / cross-tenant id forge / schema_validation_failed) / 401 / 402 (plan MAX_SCRAPERS limit on create) / 403 (N/A) / 404 (N/A) / 429 / 500.

List — GET /api/scraper/list

curl -H "Authorization: Bearer $KEY" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/list?limit=50"

Paginated (see REST overview — pagination).

Filters:

ParamEffect
qFree-text on name
groupIdOnly scrapers in this group
hasRunWithine.g. 7d — only scrapers run in last N days

Errors — 400 (N/A) / 401 / 403 (N/A) / 404 (N/A — empty list returned if none) / 429 / 500.

Get — GET /api/scraper/{id}

Single scraper by id. Returns the full ScraperDTO definition (sourceConfig, itemsConfig, postProcessRules, schema reference, etc.).

Errors — 400 (N/A — invalid id format surfaces as 404 envelope) / 401 / 403 (N/A) / 404 (scraper_not_found) / 429 / 500.

Updating an existing scraper: there is no separate PUT /api/scraper/{id} endpoint. Update is via PUT /api/scraper with id set in the body — same endpoint that handles create.

V2: enriched create/update + get — PUT /api/scraper/v2 + GET /api/scraper/v2/{id}

V2 endpoints have identical semantics to their V1 counterparts (PUT for upsert, GET for read by id) but return the richer ScraperSuperDTO shape instead of ScraperDTO: configValid flag, full inlined schema body, full inlined group body, and rolled-up runStats from the most-recent N runs.

Use V2 when you need the full picture in one round-trip instead of composing V1-get + a separate group fetch + a separate schema fetch. V2 is the preferred shape for new tooling — V1 is kept for portal back-compat.

PUT /api/scraper/v2 Authorization: Bearer <key> Idempotency-Key: <uuid> Content-Type: application/json { ... same ScraperDTO body as V1 ... }
curl -H "Authorization: Bearer $KEY" \ https://portal.scrapewise.ai/api/scraper-api/api/scraper/v2/scr_abc123

Response (200)ScraperSuperDTO (strict superset of ScraperDTO):

{ "id": "scr_abc123", "name": "competitor-prices", "configValid": true, "schema": { "id": "sch_...", "content": { /* full schema document */ } }, "group": { "id": "grp_...", "name": "competitors", "dataTable": "competitor_prices" }, "runStats": { "lastNRuns": 5, "successes": 5, "failures": 0, "avgRows": 184 }, "...": "all V1 fields preserved" }

Errors (both PUT v2 and GET v2) — same as the V1 equivalents: 400 (validation / cross-tenant id forge) / 401 / 402 (MAX_SCRAPERS on create) / 403 (N/A) / 404 (scraper_not_found on GET) / 429 / 500.

Delete a scraper (destructive — two-call protocol) — DELETE /api/scraper/{id}

Destructive operation. Deleting a scraper is permanent. By default the rows the scraper produced are preserved (still accessible via GET /api/scraper/data?scraperId=...); pass withData=true to also delete those rows.

ADR-012 two-call pattern: first preview, then commit within 5 minutes.

Steps

  1. POST /api/scraper/{id}/preview-delete[?withData=true] — mints a 5-minute DestructiveOpToken + preview summary listing what will be removed (scraper config + groups + optional data cascade).
  2. DELETE /api/scraper/{id}[?withData=true] — commits the deletion (idempotency-key required).

Skipping the preview step deletes rows without confirmation.

Step 1 — Preview

POST /api/scraper/scr_abc123/preview-delete?withData=true Authorization: Bearer <key>

Response: DestructivePreviewResponseDTO (token, opName, targetEntityId, previewSummary with cascade counts + warnings). Render to the user before committing.

Step 2 — Commit

DELETE /api/scraper/scr_abc123?withData=true Authorization: Bearer <key> Idempotency-Key: <uuid>

withData=false (default) → preserves the scraped rows; withData=true → also removes them from the group’s data collection (irreversible).

Response204 No Content.

Errors (both steps)

CodeMeaning
400Scraper doesn’t exist for this customer
401Missing/invalid bearer
403N/A
404N/A (400 used)
429Rate-limited
500Persistence failure mid-delete

Deleting only the data, not the scraper: use DELETE /api/scraper/data?scraperJobStatusId=..., which also follows the two-call protocol.

Auto-refresh values — keeping a cURL alive across the site’s deploys

Some sites rotate pieces of their URLs or headers on every deploy — Next.js sites change buildId in the _next/data/<id>/… path, Vercel rotates x-deployment-id, anti-bot middleware mints fresh tokens. A cURL captured from DevTools on Tuesday breaks on Wednesday’s deploy, and you’d otherwise have to re-paste a fresh cURL every time.

Two ways the platform solves this — both opt-in per scraper:

1. Zero-config Next.js auto-recovery (no setup needed)

When an API/CURL scraper returns 404 AND the URL matches /_next/data/<id>/, the engine automatically:

  1. Fetches the host root (e.g. https://www.ermitazas.lt/)
  2. Reads <script id="__NEXT_DATA__"> for the fresh buildId
  3. Reads <meta name="next-deployment-id"> for the fresh deployment id (if present)
  4. Rewrites the cURL with the fresh values and retries once

Works for ermitazas.lt (Vercel), karkkainen.com (Next.js on CloudFront), and any standard Next.js site without any per-scraper config. If discovery fails or returns the same tokens (genuine 404), the original error propagates unchanged.

2. Explicit sourceConfig.prelaunch (for custom rotating tokens)

For sites with non-standard rotating tokens (anti-bot middleware, JWT-shaped nonces, custom markers), declare extraction rules on sourceConfig.prelaunch. The cURL template uses {{NAME}} placeholders that get substituted with fresh values before each fetch.

{ "sourceConfig": { "curl": "curl 'https://www.ermitazas.lt/_next/data/{{BUILD_ID}}/lt/search.json?q=home4you' -H 'x-deployment-id: {{DEPLOYMENT_ID}}'", "prelaunch": { "discoveryUrl": "https://www.ermitazas.lt/", "ttlMinutes": 360, "tokens": [ { "name": "BUILD_ID", "source": "NEXT_DATA", "expression": "/buildId" }, { "name": "DEPLOYMENT_ID", "source": "META", "expression": "next-deployment-id" } ] } } }
sourceConfig.prelaunch fieldTypeDescription
discoveryUrlstringPublic page to fetch once per TTL to extract fresh values. Usually the site’s home page.
ttlMinutesint (1–43200)How long a successful resolve is cached. Default 360 (6 hours). On 404 from the proxy, the cache is invalidated and refetched.
tokensTokenRule[]One rule per placeholder. Empty list = feature disabled.
TokenRule fieldTypeDescription
namestring, ^[A-Za-z0-9_]+$Placeholder name. Referenced as {{NAME}} in the cURL — case-sensitive.
sourceenumNEXT_DATA (RFC 6901 JSON pointer into the page’s <script id="__NEXT_DATA__"> blob), META (value of a <meta> tag by its name attribute), or REGEX (first capture group of a regex run against the raw HTML).
expressionstringPer-source extractor: /buildId for NEXT_DATA, next-deployment-id for META, "buildId":"([^"]+)" for REGEX.
validatePatternstring (optional)Regex the resolved value must match. Default ^[A-Za-z0-9._=+/~-]{1,4096}$ — accepts dpl_…, base64-padded, JWT-shape; rejects URL-reserved + injection vectors (&, ?, #, space, CRLF, quotes).

Portal users: the same configuration is editable in the create/edit scraper form under “Auto-refresh values” when the scraper is in API/CURL mode. Leave the rules list empty to disable.

Behaviour notes:

  • Self-healing on rotation: if the substituted cURL returns 404/410/451, the engine invalidates the cached tokens, rediscovers, and retries once. If the fresh tokens are byte-identical to the stale ones (i.e. a genuine product 404), the retry is short-circuited.
  • Cache + concurrency: resolved tokens are cached per (scraperId, configHash) per pod. A 500-URL batch hitting the same scraper triggers exactly one discovery fetch — concurrent callers coalesce via a per-scraper mutex.
  • Cross-scraper throttle: discovery requests share a 3-permit semaphore so a cold-start storm (e.g. 50 scrapers waking up after a pod restart) doesn’t saturate the upstream proxy.
  • Failure modes: if discovery fails (network, missing markers, regex no-match, value fails validatePattern), the scrape fails with a typed prelaunch_refresh_failed exception — the system never silently sends a literal {{BUILD_ID}} to the upstream.

Trigger a run — GET /api/scraper/{id}/run

Yes, GET for run is intentional — runs are idempotent at the Idempotency-Key level (see below). Triggers an async job.

curl -H "Authorization: Bearer $KEY" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/$SCRAPER_ID/run"

Response:

{ "jobId": "job_xyz789", "scraperId": "scr_abc123", "status": "PENDING", "queuedAt": "2026-05-19T12:30:00Z" }

The job runs async. To watch progress, use the SSE stream (next section).

Idempotency for runs

Pass an Idempotency-Key header to make retries safe:

curl -H "Authorization: Bearer $KEY" \ -H "Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000" \ ".../api/scraper/$SCRAPER_ID/run"

If you retry with the same idempotency key within 24h, Scrapewise returns the original job (no double-run).

Errors — 400 (scraper_already_running / start_urls_unreachable) / 401 / 403 (N/A) / 404 (scraper_not_found) / 429 / 500.

Stream job status — GET /api/scraper-job-status/{id}/stream

Server-Sent Events stream of an individual job’s progress. The {id} is a scraper job status id (ScraperJobStatusDTO.id), not the scraper id — obtain it from GET /api/scraper/load-history filtered by scraperId.

# 1) Find the latest job for this scraper JOB_ID=$(curl -s -H "Authorization: Bearer $KEY" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/load-history?scraperId=$SCRAPER_ID&size=1" \ | jq -r '.content[0].id') # 2) Stream its progress curl -N -H "Authorization: Bearer $KEY" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper-job-status/$JOB_ID/stream"

The -N flag disables curl’s buffering so events appear as they arrive.

You’ll see events like:

event: status data: {"state":"RUNNING","itemsQuantity":0} event: status data: {"state":"RUNNING","itemsQuantity":42} event: status data: {"state":"RUNNING","itemsQuantity":140} event: status data: {"state":"SUCCESS","itemsQuantity":175,"duration":47213}

Field names follow ScraperJobStatusDTOstate, itemsQuantity, totalRequests, duration, errorMessage. The stream closes when the job reaches a terminal state (SUCCESS / FAILED / CANCELLED).

Node.js SSE example

import { fetchEventSource } from '@microsoft/fetch-event-source'; // jobStatusId obtained from /api/scraper/load-history beforehand await fetchEventSource( `https://portal.scrapewise.ai/api/scraper-api/api/scraper-job-status/${jobStatusId}/stream`, { headers: { Authorization: `Bearer ${process.env.SCRAPEWISE_KEY}` }, onmessage(ev) { const data = JSON.parse(ev.data); console.log(`${data.state} items=${data.itemsQuantity ?? '-'}`); }, } );

Python SSE example

import json, os, requests # job_status_id obtained from /api/scraper/load-history beforehand with requests.get( f'https://portal.scrapewise.ai/api/scraper-api/api/scraper-job-status/{job_status_id}/stream', headers={'Authorization': f"Bearer {os.environ['SCRAPEWISE_KEY']}"}, stream=True, ) as r: for line in r.iter_lines(): if line and line.startswith(b'data: '): event = json.loads(line[6:]) print(event['state'], event.get('itemsQuantity'))

Errors — 400 (N/A) / 401 / 403 (N/A) / 404 (N/A — invalid {id} surfaces as a 401 envelope) / 429 (N/A) / 500. The SSE connection closes with the terminal event when the job ends; long-lived 5xx errors break the connection (reconnect from your client).

Stop a running scraper — GET /api/scraper/{id}/stop

Stops the scraper if it’s currently in RUNNING state. Scrapers in any other state are silently skipped (the inner if-check short-circuits the save). Partial scraped data already written to the group’s data collection is kept — clear it with DELETE /api/scraper/data?scraperJobStatusId=... if needed.

GET (not POST/DELETE) is intentional — the operation is fully idempotent at the underlying state machine level and the Idempotency-Key header is required.

curl -H "Authorization: Bearer $KEY" \ -H "Idempotency-Key: $(uuidgen)" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/$SCRAPER_ID/stop"

Response200 OK (empty body). State transitions complete synchronously before the response is returned.

Errors — 400 (scraper_not_found) / 401 / 403 (N/A) / 404 (N/A — 400 envelope used) / 429 / 500.

To stop ALL running scrapers in a group at once, see Stop every running scraper in a group.

Sample data — GET /api/scraper/{id}/get-sample-data

Runs the scraper synchronously against its source URLs (capped to a FIRST batch / FIRST page — NOT a full run, no pagination expansion) and returns the parsed rows immediately. Use to validate a scraper config visually before scheduling a full async run.

curl -H "Authorization: Bearer $KEY" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/$SCRAPER_ID/get-sample-data?useCache=true"
ParamDefaultDescription
useCachefalseReturn the cached sample within the 1h TTL — faster, possibly stale.

Response (200)List<Map<String, Any>> keyed by the scraper’s configured field names. Values are post-processed if rules are set.

Errors — 400 (scraper_not_found) / 401 / 403 (N/A) / 404 (N/A — 400 used instead) / 429 / 500. Also 502 if the source site is unreachable.

Preview a post-process rule — POST /api/scraper/preview-rule

Apply a single PostProcessRule to a user-provided sample value and return the derived output without persisting anything or running a real scrape. Use to validate rule parameters before saving the rule onto a scraper config.

Request body shape

FieldTypeDescription
rule.kindenumCURRENCY_CONVERT / ENUM_MAP / REGEX_CLEAN
rule.sourceFieldstringName of the scraped field to read from
rule.outputFieldstringName of the derived column to write to (must differ from sourceField and not collide with reserved system fields like date, group, scraperId)
rule.outputTypeenumNUMBER / TEXT / BOOLEAN — declarative type hint for downstream consumers. The executor does NOT coerce values to this type — whatever the rule’s params produce is what’s written.
rule.paramsobjectPer-kind shape — see below
sampleValueanyThe value to test the rule against

Response (200): { "output": <derived value> } (plus rate + rateDate for CURRENCY_CONVERT).

Per-kind reference

Each rule kind has its own params shape and its own null-source behaviour. The three kinds differ deliberately — see the per-kind note below.

CURRENCY_CONVERT — convert one currency to another using ECB rates

ParamRequiredDescription
fromyes3-letter ISO code (SEK, EUR, USD, …)
toyes3-letter ISO code

Null source: no-op — the output field is not written.

Side outputs: writes {outputField}_rate and {outputField}_rateDate alongside the converted value.

POST /api/scraper/preview-rule Authorization: Bearer <key> Content-Type: application/json { "rule": { "kind": "CURRENCY_CONVERT", "sourceField": "priceSek", "outputField": "priceEur", "outputType": "NUMBER", "params": { "from": "SEK", "to": "EUR" } }, "sampleValue": 2490 }

Response: { "output": 215.85, "rate": 0.0867, "rateDate": "2026-04-17" }.

ENUM_MAP — look up the source value in a mapping table

ParamRequiredDescription
mappingyes (non-empty)Map<string, any> — key is the source value (stringified), value is what gets written to the output column
defaultnoWhat to write when no key matches (or when the matched value is null)

Null source — the null-sentinel: a null source value looks up the empty-string "" key. The same applies to an empty-string source ("") — null and "" are intentionally treated as the same “no meaningful value” for ENUM_MAP lookup purposes. This is the canonical pattern for deriving a True/False column based on whether a scraped field is present or absent on the page.

Null mapped value: if a key maps to null (e.g. {"": null}), that’s treated as “no real mapping” and falls through to default. Closes the silent-strip footgun where a null output value would be removed from the row by the downstream null-filter.

POST /api/scraper/preview-rule Authorization: Bearer <key> Content-Type: application/json { "rule": { "kind": "ENUM_MAP", "sourceField": "promotionalBanner", "outputField": "isPromoted", "outputType": "TEXT", "params": { "mapping": { "": "False", "Sale": "True", "New": "True" }, "default": "Unknown" } }, "sampleValue": null }

Response: { "output": "False" } — the empty-string mapping key matched the null source.

Worked behaviour table for the rule above:

Scraped promotionalBannerMatchesWritten to isPromoted
null / field absent from page"" row"False"
"" (empty string scraped)"" row"False"
"Sale"Sale row"True"
"New"New row"True"
"Mystery"(no row matches)"Unknown" (default)

The same null-sentinel works in the portal’s rule builder — leave the “When scraped value is” cell blank and type the desired output in “Save as”. The empty-key row shows a (matches missing / null) placeholder so it’s visually distinguishable from a half-finished add.

REGEX_CLEAN — keep only characters matching a safe character class

ParamRequiredDescription
keepnoCharacter class to keep, e.g. [0-9], [A-Za-z0-9], or a bare literal like 0-9 (auto-wrapped in […]). Default [A-Za-z0-9].

Null source: no-op — the output field is not written.

Safety: the keep pattern is restricted to simple character classes — groups (), quantifiers *+?, alternation |, and curly braces {} are all rejected to prevent catastrophic backtracking. Unsafe pattern → 200 { "error": "Unsafe REGEX_CLEAN 'keep' pattern" }.

POST /api/scraper/preview-rule Authorization: Bearer <key> Content-Type: application/json { "rule": { "kind": "REGEX_CLEAN", "sourceField": "price", "outputField": "priceClean", "outputType": "TEXT", "params": { "keep": "[0-9.]" } }, "sampleValue": "$1,234.56" }

Response: { "output": "1234.56" }.

Cross-rule null-source posture

Rule kindNull source behaviour
CURRENCY_CONVERTNo-op — output field not written
ENUM_MAPLooks up the "" key (null-sentinel) — produces default if "" is not in the mapping
REGEX_CLEANNo-op — output field not written

ENUM_MAP is the only LOOKUP rule of the three (the others are transforms), which is why it’s the only one with null-sentinel semantics. If you need True/False from null/non-null, ENUM_MAP is the rule.

Errors

On invalid rule params, the endpoint returns 200 with an error string instead of throwing — intentional non-throw shape so agents can iterate on rule params without exception handling.

400 (request body validation) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500. Rule-execution failures surface as 200 { "error": "..." } in the body, NOT as an HTTP error.

SEO fields

Get SEO fields for a scraper — GET /api/scraper/{id}/seo-fields

Returns SEO metadata (title, description, og:* tags) extracted from the scraper’s configured source URL. Cached at the SEO field service layer.

curl -H "Authorization: Bearer $KEY" \ https://portal.scrapewise.ai/api/scraper-api/api/scraper/$SCRAPER_ID/seo-fields

Response (200)List<SeoField> (name + content per discovered meta tag). Empty list if no extractable meta tags.

Errors — 400 (scraper_not_found) / 401 / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

Get SEO fields for an arbitrary URL — POST /api/scraper/seo-fields

Same extraction logic but takes a URL in the body — no pre-existing scraper required. Subject to the SSRF deny-list.

POST /api/scraper/seo-fields Authorization: Bearer <key> Content-Type: application/json { "url": "https://example.com" }

Response (200)List<SeoField>.

Errors — 400 (SSRF_BLOCKED for internal / link-local / cloud-metadata hosts) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500. Also 502 if upstream unreachable.

Group-level run + stop

Run every scraper in a group — GET /api/scraper/group/{id}/run

Triggers a manual run of EVERY scraper that belongs to the group. Each run is queued asynchronously. Requires the RUN_WITH_GROUP plan feature.

curl -H "Authorization: Bearer $KEY" \ -H "Idempotency-Key: $(uuidgen)" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/group/grp_xyz/run?mergeData=true"
ParamDefaultDescription
mergeDatafalseAfter all per-scraper runs complete, merge results into the group’s enriched-sibling collection.

Response204 No Content. Poll GET /api/scraper/load-history per scraper to check status.

Errors — 400 (group not found) / 401 / 402 (plan lacks RUN_WITH_GROUP) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

Stop every running scraper in a group — GET /api/scraper/group/{id}/stop

Stops every scraper in the group that’s currently RUNNING. Scrapers not in RUNNING state are silently skipped. Partial scraped data already written is kept — clear it via DELETE /api/scraper/data per scraperJobStatusId if needed.

curl -H "Authorization: Bearer $KEY" \ -H "Idempotency-Key: $(uuidgen)" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/group/grp_xyz/stop"

Response200 OK (empty body). State transitions complete synchronously.

Errors — 400 (group not found) / 401 / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

Fallback scraper — PUT /api/scraper/fallback

Attach a fallback scraper to a SINGLE_PRODUCT primary scraper. The fallback runs automatically when the primary extracts no data for a link (e.g. a JSON-LD parser as a backup to CSS-selector extraction). Requires the FALLBACK_SCRAPER plan feature.

PUT /api/scraper/fallback Authorization: Bearer <key> Idempotency-Key: <uuid> Content-Type: application/json { "scraperId": "scr_abc123", "fallback": { /* full ScraperDTO of the fallback scraper */ } }

Response (200) — the persisted fallback ScraperDTO with its assigned id, or null on silent failure (rare; usually surfaces as a 4xx instead).

Errors — 400 (primary scraper isn’t SINGLE_PRODUCT type / scraperId doesn’t exist) / 401 / 402 (plan lacks FALLBACK_SCRAPER) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

Debug: load raw site HTML — GET /api/scraper/load-site + /load-site-content

Two thin debug fetchers that go through the proxy infrastructure + SSRF deny-list + customer rate limit. Use to inspect a page’s structure when designing a scraper config or debugging selectors.

# Raw text/html curl -H "Authorization: Bearer $KEY" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/load-site?url=https://example.com&useCache=true" # JSON envelope { "content": "<html>" } — for MCP-style tooling curl -H "Authorization: Bearer $KEY" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/load-site-content?url=https://example.com"
ParamDescription
urlRequired. Public HTTP(S). Internal / link-local / cloud-metadata hosts are SSRF-blocked.
cookiesAcceptSelectorOptional CSS selector for a cookie-banner accept button (rendered fetch only).
useCacheOptional; return cached HTML if available (1h TTL).

Errors (both) — 400 (SSRF_BLOCKED) / 401 / 403 (N/A) / 404 (N/A) / 429 (rate limit) / 500. Also 502 if upstream unreachable.

Config parameter catalogue — GET /api/scraper/config/parameters

Returns the canonical enum lists used by scraper config builders: scraper types, pagination types, parameter types, post-process kinds, and the common-parameters catalogue. Used by the scraper-builder UI; usually not interesting to API consumers unless you’re building your own builder.

curl -H "Authorization: Bearer $KEY" \ https://portal.scrapewise.ai/api/scraper-api/api/scraper/config/parameters

Response (200)ScraperConfigParametersDTO (scraperType, paginationType, parameterType, commonParameters, postProcessKind).

Errors — 400 (N/A) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500.

Common errors

CodeMeaning
scraper_not_foundNo scraper with that id exists for your customer
scraper_already_runningTrying to trigger a run while previous job is still in flight (use cancel first, or wait for SSE SUCCEEDED)
schema_validation_failedThe schema you passed has invalid types or unsupported field names
start_urls_unreachableNone of the start URLs returned 2xx — check DNS / target availability

See Errors for the full envelope format.

What’s next

  • Read what the scraper producedData
  • One-shot preview without saving a scraperPreview
  • Use these tools from ClaudeMCP quickstart