Scrapers
The full scraper CRUD + run + watch lifecycle. A scraper is a reusable extraction recipe; runs of a scraper produce jobs; jobs produce rows of data (see Data).
Create or update — PUT /api/scraper
Upsert semantics: omit id in the body to create a new scraper, set id to update an existing one. The same endpoint covers both — there is no separate PUT /api/scraper/{id} (try it and you get 405). The body is the full ScraperDTO; partial updates are not supported — always fetch the current state via GET /api/scraper/{id}, mutate, then PUT the whole thing back.
The easiest way to discover a valid ScraperDTO for a given site is to call the Preview endpoint first — it returns one or more candidate scraperDTO payloads you can copy into your PUT body.
PUT /api/scraper
Authorization: Bearer <key>
Idempotency-Key: <uuid>
Content-Type: application/json
{
"id": null,
"name": "competitor-prices",
"groupRef": "grp_xyz",
"type": "MULTIPLE_PRODUCTS",
"sourceConfig": {
"url": "https://competitor.com/products",
"pagination": { "type": "URL_PAGE_PARAM", "param": "page", "max": 50 }
},
"itemsConfig": [
{ "name": "title", "selector": "h2.product-title", "kind": "TEXT" },
{ "name": "price", "selector": ".price", "kind": "TEXT" },
{ "name": "in_stock", "selector": ".add-to-cart", "kind": "EXISTS" }
],
"schema": { "id": "sch_..." },
"postProcessRules": []
}| Field | Required | Description |
|---|---|---|
id | no | Omit to create; set to update an existing scraper (must be yours — cross-tenant id forge is rejected) |
name | yes | Human label, unique per customer |
groupRef | yes | The group’s id this scraper belongs to. See Groups |
type | yes | SINGLE_PRODUCT / MULTIPLE_PRODUCTS / APPLICATION_LD_JSON / AI_CONF / AMAZON / GOOGLE_SEARCH |
sourceConfig | yes | Source descriptor (url + pagination, or curl). Get the right shape from Preview |
itemsConfig | conditional | Field selectors (CSS / XPath / JSON-path depending on type). Not used by AI_CONF (which uses schema instead) |
schema | for AI_CONF only | { id } reference to a customer schema. See Schemas |
postProcessRules | no | List of PostProcessRule transformations applied to each scraped row (CURRENCY_CONVERT, ENUM_MAP, REGEX_CLEAN). Each rule reads one field, writes a derived column. See Preview a post-process rule for the per-kind DTO shape + behaviour. |
schedule | no | Cron expression for scheduled runs (also set via PUT /api/scraper/group/{id}/start-type/{startType}) |
Response — the persisted ScraperDTO with the assigned id.
Errors — 400 (validation failed / cross-tenant id forge / schema_validation_failed) / 401 / 402 (plan MAX_SCRAPERS limit on create) / 403 (N/A) / 404 (N/A) / 429 / 500.
List — GET /api/scraper/list
curl -H "Authorization: Bearer $KEY" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/list?limit=50"Paginated (see REST overview — pagination).
Filters:
| Param | Effect |
|---|---|
q | Free-text on name |
groupId | Only scrapers in this group |
hasRunWithin | e.g. 7d — only scrapers run in last N days |
Errors — 400 (N/A) / 401 / 403 (N/A) / 404 (N/A — empty list returned if none) / 429 / 500.
Get — GET /api/scraper/{id}
Single scraper by id. Returns the full ScraperDTO definition (sourceConfig, itemsConfig, postProcessRules, schema reference, etc.).
Errors — 400 (N/A — invalid id format surfaces as 404 envelope) / 401 / 403 (N/A) / 404 (scraper_not_found) / 429 / 500.
Updating an existing scraper: there is no separate
PUT /api/scraper/{id}endpoint. Update is viaPUT /api/scraperwithidset in the body — same endpoint that handles create.
V2: enriched create/update + get — PUT /api/scraper/v2 + GET /api/scraper/v2/{id}
V2 endpoints have identical semantics to their V1 counterparts (PUT for upsert, GET for read by id) but return the richer ScraperSuperDTO shape instead of ScraperDTO: configValid flag, full inlined schema body, full inlined group body, and rolled-up runStats from the most-recent N runs.
Use V2 when you need the full picture in one round-trip instead of composing V1-get + a separate group fetch + a separate schema fetch. V2 is the preferred shape for new tooling — V1 is kept for portal back-compat.
PUT /api/scraper/v2
Authorization: Bearer <key>
Idempotency-Key: <uuid>
Content-Type: application/json
{ ... same ScraperDTO body as V1 ... }curl -H "Authorization: Bearer $KEY" \
https://portal.scrapewise.ai/api/scraper-api/api/scraper/v2/scr_abc123Response (200) — ScraperSuperDTO (strict superset of ScraperDTO):
{
"id": "scr_abc123",
"name": "competitor-prices",
"configValid": true,
"schema": { "id": "sch_...", "content": { /* full schema document */ } },
"group": { "id": "grp_...", "name": "competitors", "dataTable": "competitor_prices" },
"runStats": { "lastNRuns": 5, "successes": 5, "failures": 0, "avgRows": 184 },
"...": "all V1 fields preserved"
}Errors (both PUT v2 and GET v2) — same as the V1 equivalents: 400 (validation / cross-tenant id forge) / 401 / 402 (MAX_SCRAPERS on create) / 403 (N/A) / 404 (scraper_not_found on GET) / 429 / 500.
Delete a scraper (destructive — two-call protocol) — DELETE /api/scraper/{id}
Destructive operation. Deleting a scraper is permanent. By default the rows the scraper produced are preserved (still accessible via GET /api/scraper/data?scraperId=...); pass withData=true to also delete those rows.
ADR-012 two-call pattern: first preview, then commit within 5 minutes.
Steps
POST /api/scraper/{id}/preview-delete[?withData=true]— mints a 5-minuteDestructiveOpToken+ preview summary listing what will be removed (scraper config + groups + optional data cascade).DELETE /api/scraper/{id}[?withData=true]— commits the deletion (idempotency-key required).
Skipping the preview step deletes rows without confirmation.
Step 1 — Preview
POST /api/scraper/scr_abc123/preview-delete?withData=true
Authorization: Bearer <key>Response: DestructivePreviewResponseDTO (token, opName, targetEntityId, previewSummary with cascade counts + warnings). Render to the user before committing.
Step 2 — Commit
DELETE /api/scraper/scr_abc123?withData=true
Authorization: Bearer <key>
Idempotency-Key: <uuid>withData=false (default) → preserves the scraped rows; withData=true → also removes them from the group’s data collection (irreversible).
Response — 204 No Content.
Errors (both steps)
| Code | Meaning |
|---|---|
| 400 | Scraper doesn’t exist for this customer |
| 401 | Missing/invalid bearer |
| 403 | N/A |
| 404 | N/A (400 used) |
| 429 | Rate-limited |
| 500 | Persistence failure mid-delete |
Deleting only the data, not the scraper: use
DELETE /api/scraper/data?scraperJobStatusId=..., which also follows the two-call protocol.
Auto-refresh values — keeping a cURL alive across the site’s deploys
Some sites rotate pieces of their URLs or headers on every deploy — Next.js sites change buildId in the _next/data/<id>/… path, Vercel rotates x-deployment-id, anti-bot middleware mints fresh tokens. A cURL captured from DevTools on Tuesday breaks on Wednesday’s deploy, and you’d otherwise have to re-paste a fresh cURL every time.
Two ways the platform solves this — both opt-in per scraper:
1. Zero-config Next.js auto-recovery (no setup needed)
When an API/CURL scraper returns 404 AND the URL matches /_next/data/<id>/, the engine automatically:
- Fetches the host root (e.g.
https://www.ermitazas.lt/) - Reads
<script id="__NEXT_DATA__">for the freshbuildId - Reads
<meta name="next-deployment-id">for the fresh deployment id (if present) - Rewrites the cURL with the fresh values and retries once
Works for ermitazas.lt (Vercel), karkkainen.com (Next.js on CloudFront), and any standard Next.js site without any per-scraper config. If discovery fails or returns the same tokens (genuine 404), the original error propagates unchanged.
2. Explicit sourceConfig.prelaunch (for custom rotating tokens)
For sites with non-standard rotating tokens (anti-bot middleware, JWT-shaped nonces, custom markers), declare extraction rules on sourceConfig.prelaunch. The cURL template uses {{NAME}} placeholders that get substituted with fresh values before each fetch.
{
"sourceConfig": {
"curl": "curl 'https://www.ermitazas.lt/_next/data/{{BUILD_ID}}/lt/search.json?q=home4you' -H 'x-deployment-id: {{DEPLOYMENT_ID}}'",
"prelaunch": {
"discoveryUrl": "https://www.ermitazas.lt/",
"ttlMinutes": 360,
"tokens": [
{ "name": "BUILD_ID", "source": "NEXT_DATA", "expression": "/buildId" },
{ "name": "DEPLOYMENT_ID", "source": "META", "expression": "next-deployment-id" }
]
}
}
}sourceConfig.prelaunch field | Type | Description |
|---|---|---|
discoveryUrl | string | Public page to fetch once per TTL to extract fresh values. Usually the site’s home page. |
ttlMinutes | int (1–43200) | How long a successful resolve is cached. Default 360 (6 hours). On 404 from the proxy, the cache is invalidated and refetched. |
tokens | TokenRule[] | One rule per placeholder. Empty list = feature disabled. |
TokenRule field | Type | Description |
|---|---|---|
name | string, ^[A-Za-z0-9_]+$ | Placeholder name. Referenced as {{NAME}} in the cURL — case-sensitive. |
source | enum | NEXT_DATA (RFC 6901 JSON pointer into the page’s <script id="__NEXT_DATA__"> blob), META (value of a <meta> tag by its name attribute), or REGEX (first capture group of a regex run against the raw HTML). |
expression | string | Per-source extractor: /buildId for NEXT_DATA, next-deployment-id for META, "buildId":"([^"]+)" for REGEX. |
validatePattern | string (optional) | Regex the resolved value must match. Default ^[A-Za-z0-9._=+/~-]{1,4096}$ — accepts dpl_…, base64-padded, JWT-shape; rejects URL-reserved + injection vectors (&, ?, #, space, CRLF, quotes). |
Portal users: the same configuration is editable in the create/edit scraper form under “Auto-refresh values” when the scraper is in API/CURL mode. Leave the rules list empty to disable.
Behaviour notes:
- Self-healing on rotation: if the substituted cURL returns 404/410/451, the engine invalidates the cached tokens, rediscovers, and retries once. If the fresh tokens are byte-identical to the stale ones (i.e. a genuine product 404), the retry is short-circuited.
- Cache + concurrency: resolved tokens are cached per
(scraperId, configHash)per pod. A 500-URL batch hitting the same scraper triggers exactly one discovery fetch — concurrent callers coalesce via a per-scraper mutex. - Cross-scraper throttle: discovery requests share a 3-permit semaphore so a cold-start storm (e.g. 50 scrapers waking up after a pod restart) doesn’t saturate the upstream proxy.
- Failure modes: if discovery fails (network, missing markers, regex no-match, value fails
validatePattern), the scrape fails with a typedprelaunch_refresh_failedexception — the system never silently sends a literal{{BUILD_ID}}to the upstream.
Trigger a run — GET /api/scraper/{id}/run
Yes, GET for run is intentional — runs are idempotent at the Idempotency-Key level (see below). Triggers an async job.
curl -H "Authorization: Bearer $KEY" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/$SCRAPER_ID/run"Response:
{
"jobId": "job_xyz789",
"scraperId": "scr_abc123",
"status": "PENDING",
"queuedAt": "2026-05-19T12:30:00Z"
}The job runs async. To watch progress, use the SSE stream (next section).
Idempotency for runs
Pass an Idempotency-Key header to make retries safe:
curl -H "Authorization: Bearer $KEY" \
-H "Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000" \
".../api/scraper/$SCRAPER_ID/run"If you retry with the same idempotency key within 24h, Scrapewise returns the original job (no double-run).
Errors — 400 (scraper_already_running / start_urls_unreachable) / 401 / 403 (N/A) / 404 (scraper_not_found) / 429 / 500.
Stream job status — GET /api/scraper-job-status/{id}/stream
Server-Sent Events stream of an individual job’s progress. The {id} is a scraper job status id (ScraperJobStatusDTO.id), not the scraper id — obtain it from GET /api/scraper/load-history filtered by scraperId.
# 1) Find the latest job for this scraper
JOB_ID=$(curl -s -H "Authorization: Bearer $KEY" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/load-history?scraperId=$SCRAPER_ID&size=1" \
| jq -r '.content[0].id')
# 2) Stream its progress
curl -N -H "Authorization: Bearer $KEY" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper-job-status/$JOB_ID/stream"The -N flag disables curl’s buffering so events appear as they arrive.
You’ll see events like:
event: status
data: {"state":"RUNNING","itemsQuantity":0}
event: status
data: {"state":"RUNNING","itemsQuantity":42}
event: status
data: {"state":"RUNNING","itemsQuantity":140}
event: status
data: {"state":"SUCCESS","itemsQuantity":175,"duration":47213}Field names follow ScraperJobStatusDTO — state, itemsQuantity, totalRequests, duration, errorMessage. The stream closes when the job reaches a terminal state (SUCCESS / FAILED / CANCELLED).
Node.js SSE example
import { fetchEventSource } from '@microsoft/fetch-event-source';
// jobStatusId obtained from /api/scraper/load-history beforehand
await fetchEventSource(
`https://portal.scrapewise.ai/api/scraper-api/api/scraper-job-status/${jobStatusId}/stream`,
{
headers: { Authorization: `Bearer ${process.env.SCRAPEWISE_KEY}` },
onmessage(ev) {
const data = JSON.parse(ev.data);
console.log(`${data.state} items=${data.itemsQuantity ?? '-'}`);
},
}
);Python SSE example
import json, os, requests
# job_status_id obtained from /api/scraper/load-history beforehand
with requests.get(
f'https://portal.scrapewise.ai/api/scraper-api/api/scraper-job-status/{job_status_id}/stream',
headers={'Authorization': f"Bearer {os.environ['SCRAPEWISE_KEY']}"},
stream=True,
) as r:
for line in r.iter_lines():
if line and line.startswith(b'data: '):
event = json.loads(line[6:])
print(event['state'], event.get('itemsQuantity'))Errors — 400 (N/A) / 401 / 403 (N/A) / 404 (N/A — invalid {id} surfaces as a 401 envelope) / 429 (N/A) / 500. The SSE connection closes with the terminal event when the job ends; long-lived 5xx errors break the connection (reconnect from your client).
Stop a running scraper — GET /api/scraper/{id}/stop
Stops the scraper if it’s currently in RUNNING state. Scrapers in any other state are silently skipped (the inner if-check short-circuits the save). Partial scraped data already written to the group’s data collection is kept — clear it with DELETE /api/scraper/data?scraperJobStatusId=... if needed.
GET(not POST/DELETE) is intentional — the operation is fully idempotent at the underlying state machine level and theIdempotency-Keyheader is required.
curl -H "Authorization: Bearer $KEY" \
-H "Idempotency-Key: $(uuidgen)" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/$SCRAPER_ID/stop"Response — 200 OK (empty body). State transitions complete synchronously before the response is returned.
Errors — 400 (scraper_not_found) / 401 / 403 (N/A) / 404 (N/A — 400 envelope used) / 429 / 500.
To stop ALL running scrapers in a group at once, see Stop every running scraper in a group.
Sample data — GET /api/scraper/{id}/get-sample-data
Runs the scraper synchronously against its source URLs (capped to a FIRST batch / FIRST page — NOT a full run, no pagination expansion) and returns the parsed rows immediately. Use to validate a scraper config visually before scheduling a full async run.
curl -H "Authorization: Bearer $KEY" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/$SCRAPER_ID/get-sample-data?useCache=true"| Param | Default | Description |
|---|---|---|
useCache | false | Return the cached sample within the 1h TTL — faster, possibly stale. |
Response (200) — List<Map<String, Any>> keyed by the scraper’s configured field names. Values are post-processed if rules are set.
Errors — 400 (scraper_not_found) / 401 / 403 (N/A) / 404 (N/A — 400 used instead) / 429 / 500. Also 502 if the source site is unreachable.
Preview a post-process rule — POST /api/scraper/preview-rule
Apply a single PostProcessRule to a user-provided sample value and return the derived output without persisting anything or running a real scrape. Use to validate rule parameters before saving the rule onto a scraper config.
Request body shape
| Field | Type | Description |
|---|---|---|
rule.kind | enum | CURRENCY_CONVERT / ENUM_MAP / REGEX_CLEAN |
rule.sourceField | string | Name of the scraped field to read from |
rule.outputField | string | Name of the derived column to write to (must differ from sourceField and not collide with reserved system fields like date, group, scraperId) |
rule.outputType | enum | NUMBER / TEXT / BOOLEAN — declarative type hint for downstream consumers. The executor does NOT coerce values to this type — whatever the rule’s params produce is what’s written. |
rule.params | object | Per-kind shape — see below |
sampleValue | any | The value to test the rule against |
Response (200): { "output": <derived value> } (plus rate + rateDate for CURRENCY_CONVERT).
Per-kind reference
Each rule kind has its own params shape and its own null-source behaviour. The three kinds differ deliberately — see the per-kind note below.
CURRENCY_CONVERT — convert one currency to another using ECB rates
| Param | Required | Description |
|---|---|---|
from | yes | 3-letter ISO code (SEK, EUR, USD, …) |
to | yes | 3-letter ISO code |
Null source: no-op — the output field is not written.
Side outputs: writes {outputField}_rate and {outputField}_rateDate alongside the converted value.
POST /api/scraper/preview-rule
Authorization: Bearer <key>
Content-Type: application/json
{
"rule": {
"kind": "CURRENCY_CONVERT",
"sourceField": "priceSek",
"outputField": "priceEur",
"outputType": "NUMBER",
"params": { "from": "SEK", "to": "EUR" }
},
"sampleValue": 2490
}Response: { "output": 215.85, "rate": 0.0867, "rateDate": "2026-04-17" }.
ENUM_MAP — look up the source value in a mapping table
| Param | Required | Description |
|---|---|---|
mapping | yes (non-empty) | Map<string, any> — key is the source value (stringified), value is what gets written to the output column |
default | no | What to write when no key matches (or when the matched value is null) |
Null source — the null-sentinel: a null source value looks up the empty-string "" key. The same applies to an empty-string source ("") — null and "" are intentionally treated as the same “no meaningful value” for ENUM_MAP lookup purposes. This is the canonical pattern for deriving a True/False column based on whether a scraped field is present or absent on the page.
Null mapped value: if a key maps to null (e.g. {"": null}), that’s treated as “no real mapping” and falls through to default. Closes the silent-strip footgun where a null output value would be removed from the row by the downstream null-filter.
POST /api/scraper/preview-rule
Authorization: Bearer <key>
Content-Type: application/json
{
"rule": {
"kind": "ENUM_MAP",
"sourceField": "promotionalBanner",
"outputField": "isPromoted",
"outputType": "TEXT",
"params": {
"mapping": {
"": "False",
"Sale": "True",
"New": "True"
},
"default": "Unknown"
}
},
"sampleValue": null
}Response: { "output": "False" } — the empty-string mapping key matched the null source.
Worked behaviour table for the rule above:
Scraped promotionalBanner | Matches | Written to isPromoted |
|---|---|---|
null / field absent from page | "" row | "False" |
"" (empty string scraped) | "" row | "False" |
"Sale" | Sale row | "True" |
"New" | New row | "True" |
"Mystery" | (no row matches) | "Unknown" (default) |
The same null-sentinel works in the portal’s rule builder — leave the “When scraped value is” cell blank and type the desired output in “Save as”. The empty-key row shows a
(matches missing / null)placeholder so it’s visually distinguishable from a half-finished add.
REGEX_CLEAN — keep only characters matching a safe character class
| Param | Required | Description |
|---|---|---|
keep | no | Character class to keep, e.g. [0-9], [A-Za-z0-9], or a bare literal like 0-9 (auto-wrapped in […]). Default [A-Za-z0-9]. |
Null source: no-op — the output field is not written.
Safety: the keep pattern is restricted to simple character classes — groups (), quantifiers *+?, alternation |, and curly braces {} are all rejected to prevent catastrophic backtracking. Unsafe pattern → 200 { "error": "Unsafe REGEX_CLEAN 'keep' pattern" }.
POST /api/scraper/preview-rule
Authorization: Bearer <key>
Content-Type: application/json
{
"rule": {
"kind": "REGEX_CLEAN",
"sourceField": "price",
"outputField": "priceClean",
"outputType": "TEXT",
"params": { "keep": "[0-9.]" }
},
"sampleValue": "$1,234.56"
}Response: { "output": "1234.56" }.
Cross-rule null-source posture
| Rule kind | Null source behaviour |
|---|---|
CURRENCY_CONVERT | No-op — output field not written |
ENUM_MAP | Looks up the "" key (null-sentinel) — produces default if "" is not in the mapping |
REGEX_CLEAN | No-op — output field not written |
ENUM_MAP is the only LOOKUP rule of the three (the others are transforms), which is why it’s the only one with null-sentinel semantics. If you need True/False from null/non-null, ENUM_MAP is the rule.
Errors
On invalid rule params, the endpoint returns 200 with an error string instead of throwing — intentional non-throw shape so agents can iterate on rule params without exception handling.
400 (request body validation) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500. Rule-execution failures surface as 200 { "error": "..." } in the body, NOT as an HTTP error.
SEO fields
Get SEO fields for a scraper — GET /api/scraper/{id}/seo-fields
Returns SEO metadata (title, description, og:* tags) extracted from the scraper’s configured source URL. Cached at the SEO field service layer.
curl -H "Authorization: Bearer $KEY" \
https://portal.scrapewise.ai/api/scraper-api/api/scraper/$SCRAPER_ID/seo-fieldsResponse (200) — List<SeoField> (name + content per discovered meta tag). Empty list if no extractable meta tags.
Errors — 400 (scraper_not_found) / 401 / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.
Get SEO fields for an arbitrary URL — POST /api/scraper/seo-fields
Same extraction logic but takes a URL in the body — no pre-existing scraper required. Subject to the SSRF deny-list.
POST /api/scraper/seo-fields
Authorization: Bearer <key>
Content-Type: application/json
{ "url": "https://example.com" }Response (200) — List<SeoField>.
Errors — 400 (SSRF_BLOCKED for internal / link-local / cloud-metadata hosts) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500. Also 502 if upstream unreachable.
Group-level run + stop
Run every scraper in a group — GET /api/scraper/group/{id}/run
Triggers a manual run of EVERY scraper that belongs to the group. Each run is queued asynchronously. Requires the RUN_WITH_GROUP plan feature.
curl -H "Authorization: Bearer $KEY" \
-H "Idempotency-Key: $(uuidgen)" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/group/grp_xyz/run?mergeData=true"| Param | Default | Description |
|---|---|---|
mergeData | false | After all per-scraper runs complete, merge results into the group’s enriched-sibling collection. |
Response — 204 No Content. Poll GET /api/scraper/load-history per scraper to check status.
Errors — 400 (group not found) / 401 / 402 (plan lacks RUN_WITH_GROUP) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.
Stop every running scraper in a group — GET /api/scraper/group/{id}/stop
Stops every scraper in the group that’s currently RUNNING. Scrapers not in RUNNING state are silently skipped. Partial scraped data already written is kept — clear it via DELETE /api/scraper/data per scraperJobStatusId if needed.
curl -H "Authorization: Bearer $KEY" \
-H "Idempotency-Key: $(uuidgen)" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/group/grp_xyz/stop"Response — 200 OK (empty body). State transitions complete synchronously.
Errors — 400 (group not found) / 401 / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.
Fallback scraper — PUT /api/scraper/fallback
Attach a fallback scraper to a SINGLE_PRODUCT primary scraper. The fallback runs automatically when the primary extracts no data for a link (e.g. a JSON-LD parser as a backup to CSS-selector extraction). Requires the FALLBACK_SCRAPER plan feature.
PUT /api/scraper/fallback
Authorization: Bearer <key>
Idempotency-Key: <uuid>
Content-Type: application/json
{
"scraperId": "scr_abc123",
"fallback": { /* full ScraperDTO of the fallback scraper */ }
}Response (200) — the persisted fallback ScraperDTO with its assigned id, or null on silent failure (rare; usually surfaces as a 4xx instead).
Errors — 400 (primary scraper isn’t SINGLE_PRODUCT type / scraperId doesn’t exist) / 401 / 402 (plan lacks FALLBACK_SCRAPER) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.
Debug: load raw site HTML — GET /api/scraper/load-site + /load-site-content
Two thin debug fetchers that go through the proxy infrastructure + SSRF deny-list + customer rate limit. Use to inspect a page’s structure when designing a scraper config or debugging selectors.
# Raw text/html
curl -H "Authorization: Bearer $KEY" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/load-site?url=https://example.com&useCache=true"
# JSON envelope { "content": "<html>" } — for MCP-style tooling
curl -H "Authorization: Bearer $KEY" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/load-site-content?url=https://example.com"| Param | Description |
|---|---|
url | Required. Public HTTP(S). Internal / link-local / cloud-metadata hosts are SSRF-blocked. |
cookiesAcceptSelector | Optional CSS selector for a cookie-banner accept button (rendered fetch only). |
useCache | Optional; return cached HTML if available (1h TTL). |
Errors (both) — 400 (SSRF_BLOCKED) / 401 / 403 (N/A) / 404 (N/A) / 429 (rate limit) / 500. Also 502 if upstream unreachable.
Config parameter catalogue — GET /api/scraper/config/parameters
Returns the canonical enum lists used by scraper config builders: scraper types, pagination types, parameter types, post-process kinds, and the common-parameters catalogue. Used by the scraper-builder UI; usually not interesting to API consumers unless you’re building your own builder.
curl -H "Authorization: Bearer $KEY" \
https://portal.scrapewise.ai/api/scraper-api/api/scraper/config/parametersResponse (200) — ScraperConfigParametersDTO (scraperType, paginationType, parameterType, commonParameters, postProcessKind).
Errors — 400 (N/A) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500.
Common errors
| Code | Meaning |
|---|---|
scraper_not_found | No scraper with that id exists for your customer |
scraper_already_running | Trying to trigger a run while previous job is still in flight (use cancel first, or wait for SSE SUCCEEDED) |
schema_validation_failed | The schema you passed has invalid types or unsupported field names |
start_urls_unreachable | None of the start URLs returned 2xx — check DNS / target availability |
See Errors for the full envelope format.
What’s next
- Read what the scraper produced → Data
- One-shot preview without saving a scraper → Preview
- Use these tools from Claude → MCP quickstart