Scrapers

The full scraper CRUD + run + watch lifecycle. A scraper is a reusable extraction recipe; runs of a scraper produce jobs; jobs produce rows of data (see Data).

Create or update — `PUT /api/scraper`

Upsert semantics: omit id in the body to create a new scraper, set id to update an existing one. The same endpoint covers both — there is no separate PUT /api/scraper/{id} (try it and you get 405). The body is the full ScraperDTO; partial updates are not supported — always fetch the current state via GET /api/scraper/{id}, mutate, then PUT the whole thing back.

The easiest way to discover a valid ScraperDTO for a given site is to call the Preview endpoint first — it returns one or more candidate scraperDTO payloads you can copy into your PUT body.


PUT /api/scraper
Authorization: Bearer <key>
Idempotency-Key: <uuid>
Content-Type: application/json
 
{
  "id": null,
  "name": "competitor-prices",
  "groupRef": "grp_xyz",
  "type": "MULTIPLE_PRODUCTS",
  "sourceConfig": {
    "url": "https://competitor.com/products",
    "pagination": { "type": "URL_PAGE_PARAM", "param": "page", "max": 50 }
  },
  "itemsConfig": [
    { "name": "title",    "selector": "h2.product-title",    "kind": "TEXT"   },
    { "name": "price",    "selector": ".price",              "kind": "TEXT"   },
    { "name": "in_stock", "selector": ".add-to-cart",        "kind": "EXISTS" }
  ],
  "schema": { "id": "sch_..." },
  "postProcessRules": []
}

Field	Required	Description
`id`	no	Omit to create; set to update an existing scraper (must be yours — cross-tenant id forge is rejected)
`name`	yes	Human label, unique per customer
`groupRef`	yes	The group’s `id` this scraper belongs to. See Groups
`type`	yes	`SINGLE_PRODUCT` / `MULTIPLE_PRODUCTS` / `APPLICATION_LD_JSON` / `AI_CONF` / `AMAZON` / `GOOGLE_SEARCH`
`sourceConfig`	yes	Source descriptor (url + pagination, or curl). Get the right shape from Preview
`itemsConfig`	conditional	Field selectors (CSS / XPath / JSON-path depending on type). Not used by `AI_CONF` (which uses `schema` instead)
`schema`	for `AI_CONF` only	`{ id }` reference to a customer schema. See Schemas
`postProcessRules`	no	List of `PostProcessRule` transformations applied to each scraped row (`CURRENCY_CONVERT`, `ENUM_MAP`, `REGEX_CLEAN`). Each rule reads one field, writes a derived column. See Preview a post-process rule for the per-kind DTO shape + behaviour.
`schedule`	no	Cron expression for scheduled runs (also set via `PUT /api/scraper/group/{id}/start-type/{startType}`)

Response — the persisted ScraperDTO with the assigned id.

Errors — 400 (validation failed / cross-tenant id forge / schema_validation_failed) / 401 / 402 (plan MAX_SCRAPERS limit on create) / 403 (N/A) / 404 (N/A) / 429 / 500.

List — `GET /api/scraper/list`


curl -H "Authorization: Bearer $KEY" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper/list?limit=50"

Paginated (see REST overview — pagination).

Filters:

Param	Effect
`q`	Free-text on `name`
`groupId`	Only scrapers in this group
`hasRunWithin`	e.g. `7d` — only scrapers run in last N days

Errors — 400 (N/A) / 401 / 403 (N/A) / 404 (N/A — empty list returned if none) / 429 / 500.

Get — `GET /api/scraper/{id}`

Single scraper by id. Returns the full ScraperDTO definition (sourceConfig, itemsConfig, postProcessRules, schema reference, etc.).

Errors — 400 (N/A — invalid id format surfaces as 404 envelope) / 401 / 403 (N/A) / 404 (scraper_not_found) / 429 / 500.

Updating an existing scraper: there is no separate PUT /api/scraper/{id} endpoint. Update is via PUT /api/scraper with id set in the body — same endpoint that handles create.

V2: enriched create/update + get — `PUT /api/scraper/v2` + `GET /api/scraper/v2/{id}`

V2 endpoints have identical semantics to their V1 counterparts (PUT for upsert, GET for read by id) but return the richer ScraperSuperDTO shape instead of ScraperDTO: configValid flag, full inlined schema body, full inlined group body, and rolled-up runStats from the most-recent N runs.

Use V2 when you need the full picture in one round-trip instead of composing V1-get + a separate group fetch + a separate schema fetch. V2 is the preferred shape for new tooling — V1 is kept for portal back-compat.


PUT /api/scraper/v2
Authorization: Bearer <key>
Idempotency-Key: <uuid>
Content-Type: application/json
 
{ ... same ScraperDTO body as V1 ... }


curl -H "Authorization: Bearer $KEY" \
  https://portal.scrapewise.ai/api/scraper-api/api/scraper/v2/scr_abc123

Response (200) — ScraperSuperDTO (strict superset of ScraperDTO):


{
  "id": "scr_abc123",
  "name": "competitor-prices",
  "configValid": true,
  "schema": { "id": "sch_...", "content": { /* full schema document */ } },
  "group": { "id": "grp_...", "name": "competitors", "dataTable": "competitor_prices" },
  "runStats": { "lastNRuns": 5, "successes": 5, "failures": 0, "avgRows": 184 },
  "...": "all V1 fields preserved"
}

Errors (both PUT v2 and GET v2) — same as the V1 equivalents: 400 (validation / cross-tenant id forge) / 401 / 402 (MAX_SCRAPERS on create) / 403 (N/A) / 404 (scraper_not_found on GET) / 429 / 500.

Delete a scraper (destructive — two-call protocol) — `DELETE /api/scraper/{id}`

Destructive operation. Deleting a scraper is permanent. By default the rows the scraper produced are preserved (still accessible via GET /api/scraper/data?scraperId=...); pass withData=true to also delete those rows.

ADR-012 two-call pattern: first preview, then commit within 5 minutes.

Steps

POST /api/scraper/{id}/preview-delete[?withData=true] — mints a 5-minute DestructiveOpToken + preview summary listing what will be removed (scraper config + groups + optional data cascade).
DELETE /api/scraper/{id}[?withData=true] — commits the deletion (idempotency-key required).

Skipping the preview step deletes rows without confirmation.

Step 1 — Preview


POST /api/scraper/scr_abc123/preview-delete?withData=true
Authorization: Bearer <key>

Response: DestructivePreviewResponseDTO (token, opName, targetEntityId, previewSummary with cascade counts + warnings). Render to the user before committing.

Step 2 — Commit


DELETE /api/scraper/scr_abc123?withData=true
Authorization: Bearer <key>
Idempotency-Key: <uuid>

withData=false (default) → preserves the scraped rows; withData=true → also removes them from the group’s data collection (irreversible).

Response — 204 No Content.

Errors (both steps)

Code	Meaning
400	Scraper doesn’t exist for this customer
401	Missing/invalid bearer
403	N/A
404	N/A (400 used)
429	Rate-limited
500	Persistence failure mid-delete

Deleting only the data, not the scraper: use DELETE /api/scraper/data?scraperJobStatusId=..., which also follows the two-call protocol.

Auto-refresh values — keeping a cURL alive across the site’s deploys

Some sites rotate pieces of their URLs or headers on every deploy — Next.js sites change buildId in the _next/data/<id>/… path, Vercel rotates x-deployment-id, anti-bot middleware mints fresh tokens. A cURL captured from DevTools on Tuesday breaks on Wednesday’s deploy, and you’d otherwise have to re-paste a fresh cURL every time.

Two ways the platform solves this — both opt-in per scraper:

1. Zero-config Next.js auto-recovery (no setup needed)

When an API/CURL scraper returns 404 AND the URL matches /_next/data/<id>/, the engine automatically:

Fetches the host root (e.g. https://www.ermitazas.lt/)
Reads <script id="__NEXT_DATA__"> for the fresh buildId
Reads <meta name="next-deployment-id"> for the fresh deployment id (if present)
Rewrites the cURL with the fresh values and retries once

Works for ermitazas.lt (Vercel), karkkainen.com (Next.js on CloudFront), and any standard Next.js site without any per-scraper config. If discovery fails or returns the same tokens (genuine 404), the original error propagates unchanged.

2. Explicit `sourceConfig.prelaunch` (for custom rotating tokens)

For sites with non-standard rotating tokens (anti-bot middleware, JWT-shaped nonces, custom markers), declare extraction rules on sourceConfig.prelaunch. The cURL template uses {{NAME}} placeholders that get substituted with fresh values before each fetch.


{
  "sourceConfig": {
    "curl": "curl 'https://www.ermitazas.lt/_next/data/{{BUILD_ID}}/lt/search.json?q=home4you' -H 'x-deployment-id: {{DEPLOYMENT_ID}}'",
    "prelaunch": {
      "discoveryUrl": "https://www.ermitazas.lt/",
      "ttlMinutes": 360,
      "tokens": [
        { "name": "BUILD_ID",      "source": "NEXT_DATA", "expression": "/buildId" },
        { "name": "DEPLOYMENT_ID", "source": "META",      "expression": "next-deployment-id" }
      ]
    }
  }
}

`sourceConfig.prelaunch` field	Type	Description
`discoveryUrl`	string	Public page to fetch once per TTL to extract fresh values. Usually the site’s home page.
`ttlMinutes`	int (1–43200)	How long a successful resolve is cached. Default 360 (6 hours). On 404 from the proxy, the cache is invalidated and refetched.
`tokens`	`TokenRule[]`	One rule per placeholder. Empty list = feature disabled.

`TokenRule` field	Type	Description
`name`	string, `^[A-Za-z0-9_]+$`	Placeholder name. Referenced as `{{NAME}}` in the cURL — case-sensitive.
`source`	enum	`NEXT_DATA` (RFC 6901 JSON pointer into the page’s `<script id="__NEXT_DATA__">` blob), `META` (value of a `<meta>` tag by its `name` attribute), or `REGEX` (first capture group of a regex run against the raw HTML).
`expression`	string	Per-source extractor: `/buildId` for NEXT_DATA, `next-deployment-id` for META, `"buildId":"([^"]+)"` for REGEX.
`validatePattern`	string (optional)	Regex the resolved value must match. Default `^[A-Za-z0-9._=+/~-]{1,4096}$` — accepts `dpl_…`, base64-padded, JWT-shape; rejects URL-reserved + injection vectors (`&`, `?`, `#`, space, CRLF, quotes).

Portal users: the same configuration is editable in the create/edit scraper form under “Auto-refresh values” when the scraper is in API/CURL mode. Leave the rules list empty to disable.

Behaviour notes:

Self-healing on rotation: if the substituted cURL returns 404/410/451, the engine invalidates the cached tokens, rediscovers, and retries once. If the fresh tokens are byte-identical to the stale ones (i.e. a genuine product 404), the retry is short-circuited.
Cache + concurrency: resolved tokens are cached per (scraperId, configHash) per pod. A 500-URL batch hitting the same scraper triggers exactly one discovery fetch — concurrent callers coalesce via a per-scraper mutex.
Cross-scraper throttle: discovery requests share a 3-permit semaphore so a cold-start storm (e.g. 50 scrapers waking up after a pod restart) doesn’t saturate the upstream proxy.
Failure modes: if discovery fails (network, missing markers, regex no-match, value fails validatePattern), the scrape fails with a typed prelaunch_refresh_failed exception — the system never silently sends a literal {{BUILD_ID}} to the upstream.

Trigger a run — `GET /api/scraper/{id}/run`

Yes, GET for run is intentional — runs are idempotent at the Idempotency-Key level (see below). Triggers an async job.


curl -H "Authorization: Bearer $KEY" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper/$SCRAPER_ID/run"

Response:


{
  "jobId": "job_xyz789",
  "scraperId": "scr_abc123",
  "status": "PENDING",
  "queuedAt": "2026-05-19T12:30:00Z"
}

The job runs async. To watch progress, use the SSE stream (next section).

Idempotency for runs

Pass an Idempotency-Key header to make retries safe:


curl -H "Authorization: Bearer $KEY" \
  -H "Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000" \
  ".../api/scraper/$SCRAPER_ID/run"

If you retry with the same idempotency key within 24h, Scrapewise returns the original job (no double-run).

Errors — 400 (scraper_already_running / start_urls_unreachable) / 401 / 403 (N/A) / 404 (scraper_not_found) / 429 / 500.

Stream job status — `GET /api/scraper-job-status/{id}/stream`

Server-Sent Events stream of an individual job’s progress. The {id} is a scraper job status id (ScraperJobStatusDTO.id), not the scraper id — obtain it from GET /api/scraper/load-history filtered by scraperId.


# 1) Find the latest job for this scraper
JOB_ID=$(curl -s -H "Authorization: Bearer $KEY" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper/load-history?scraperId=$SCRAPER_ID&size=1" \
  | jq -r '.content[0].id')
 
# 2) Stream its progress
curl -N -H "Authorization: Bearer $KEY" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper-job-status/$JOB_ID/stream"

The -N flag disables curl’s buffering so events appear as they arrive.

You’ll see events like:


event: status
data: {"state":"RUNNING","itemsQuantity":0}

event: status
data: {"state":"RUNNING","itemsQuantity":42}

event: status
data: {"state":"RUNNING","itemsQuantity":140}

event: status
data: {"state":"SUCCESS","itemsQuantity":175,"duration":47213}

Field names follow ScraperJobStatusDTO — state, itemsQuantity, totalRequests, duration, errorMessage. The stream closes when the job reaches a terminal state (SUCCESS / FAILED / CANCELLED).

Node.js SSE example


import { fetchEventSource } from '@microsoft/fetch-event-source';
 
// jobStatusId obtained from /api/scraper/load-history beforehand
await fetchEventSource(
  `https://portal.scrapewise.ai/api/scraper-api/api/scraper-job-status/${jobStatusId}/stream`,
  {
    headers: { Authorization: `Bearer ${process.env.SCRAPEWISE_KEY}` },
    onmessage(ev) {
      const data = JSON.parse(ev.data);
      console.log(`${data.state} items=${data.itemsQuantity ?? '-'}`);
    },
  }
);

Python SSE example


import json, os, requests
 
# job_status_id obtained from /api/scraper/load-history beforehand
with requests.get(
    f'https://portal.scrapewise.ai/api/scraper-api/api/scraper-job-status/{job_status_id}/stream',
    headers={'Authorization': f"Bearer {os.environ['SCRAPEWISE_KEY']}"},
    stream=True,
) as r:
    for line in r.iter_lines():
        if line and line.startswith(b'data: '):
            event = json.loads(line[6:])
            print(event['state'], event.get('itemsQuantity'))

Errors — 400 (N/A) / 401 / 403 (N/A) / 404 (N/A — invalid {id} surfaces as a 401 envelope) / 429 (N/A) / 500. The SSE connection closes with the terminal event when the job ends; long-lived 5xx errors break the connection (reconnect from your client).

Stop a running scraper — `GET /api/scraper/{id}/stop`

Stops the scraper if it’s currently in RUNNING state. Scrapers in any other state are silently skipped (the inner if-check short-circuits the save). Partial scraped data already written to the group’s data collection is kept — clear it with DELETE /api/scraper/data?scraperJobStatusId=... if needed.

GET (not POST/DELETE) is intentional — the operation is fully idempotent at the underlying state machine level and the Idempotency-Key header is required.


curl -H "Authorization: Bearer $KEY" \
  -H "Idempotency-Key: $(uuidgen)" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper/$SCRAPER_ID/stop"

Response — 200 OK (empty body). State transitions complete synchronously before the response is returned.

Errors — 400 (scraper_not_found) / 401 / 403 (N/A) / 404 (N/A — 400 envelope used) / 429 / 500.

To stop ALL running scrapers in a group at once, see Stop every running scraper in a group.

Sample data — `GET /api/scraper/{id}/get-sample-data`

Runs the scraper synchronously against its source URLs (capped to a FIRST batch / FIRST page — NOT a full run, no pagination expansion) and returns the parsed rows immediately. Use to validate a scraper config visually before scheduling a full async run.


curl -H "Authorization: Bearer $KEY" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper/$SCRAPER_ID/get-sample-data?useCache=true"

Param	Default	Description
`useCache`	`false`	Return the cached sample within the 1h TTL — faster, possibly stale.

Response (200) — List<Map<String, Any>> keyed by the scraper’s configured field names. Values are post-processed if rules are set.

Errors — 400 (scraper_not_found) / 401 / 403 (N/A) / 404 (N/A — 400 used instead) / 429 / 500. Also 502 if the source site is unreachable.

Preview a post-process rule — `POST /api/scraper/preview-rule`

Apply a single PostProcessRule to a user-provided sample value and return the derived output without persisting anything or running a real scrape. Use to validate rule parameters before saving the rule onto a scraper config.

Request body shape

Field	Type	Description
`rule.kind`	enum	`CURRENCY_CONVERT` / `ENUM_MAP` / `REGEX_CLEAN`
`rule.sourceField`	string	Name of the scraped field to read from
`rule.outputField`	string	Name of the derived column to write to (must differ from `sourceField` and not collide with reserved system fields like `date`, `group`, `scraperId`)
`rule.outputType`	enum	`NUMBER` / `TEXT` / `BOOLEAN` — declarative type hint for downstream consumers. The executor does NOT coerce values to this type — whatever the rule’s params produce is what’s written.
`rule.params`	object	Per-kind shape — see below
`sampleValue`	any	The value to test the rule against

Response (200): { "output": <derived value> } (plus rate + rateDate for CURRENCY_CONVERT).

Per-kind reference

Each rule kind has its own params shape and its own null-source behaviour. The three kinds differ deliberately — see the per-kind note below.

`CURRENCY_CONVERT` — convert one currency to another using ECB rates

Param	Required	Description
`from`	yes	3-letter ISO code (`SEK`, `EUR`, `USD`, …)
`to`	yes	3-letter ISO code

Null source: no-op — the output field is not written.

Side outputs: writes {outputField}_rate and {outputField}_rateDate alongside the converted value.


POST /api/scraper/preview-rule
Authorization: Bearer <key>
Content-Type: application/json
 
{
  "rule": {
    "kind": "CURRENCY_CONVERT",
    "sourceField": "priceSek",
    "outputField": "priceEur",
    "outputType": "NUMBER",
    "params": { "from": "SEK", "to": "EUR" }
  },
  "sampleValue": 2490
}

Response: { "output": 215.85, "rate": 0.0867, "rateDate": "2026-04-17" }.

`ENUM_MAP` — look up the source value in a mapping table

Param	Required	Description
`mapping`	yes (non-empty)	`Map<string, any>` — key is the source value (stringified), value is what gets written to the output column
`default`	no	What to write when no key matches (or when the matched value is `null`)

Null source — the null-sentinel: a null source value looks up the empty-string "" key. The same applies to an empty-string source ("") — null and "" are intentionally treated as the same “no meaningful value” for ENUM_MAP lookup purposes. This is the canonical pattern for deriving a True/False column based on whether a scraped field is present or absent on the page.

Null mapped value: if a key maps to null (e.g. {"": null}), that’s treated as “no real mapping” and falls through to default. Closes the silent-strip footgun where a null output value would be removed from the row by the downstream null-filter.


POST /api/scraper/preview-rule
Authorization: Bearer <key>
Content-Type: application/json
 
{
  "rule": {
    "kind": "ENUM_MAP",
    "sourceField": "promotionalBanner",
    "outputField": "isPromoted",
    "outputType": "TEXT",
    "params": {
      "mapping": {
        "":     "False",
        "Sale": "True",
        "New":  "True"
      },
      "default": "Unknown"
    }
  },
  "sampleValue": null
}

Response: { "output": "False" } — the empty-string mapping key matched the null source.

Worked behaviour table for the rule above:

Scraped `promotionalBanner`	Matches	Written to `isPromoted`
`null` / field absent from page	`""` row	`"False"`
`""` (empty string scraped)	`""` row	`"False"`
`"Sale"`	`Sale` row	`"True"`
`"New"`	`New` row	`"True"`
`"Mystery"`	(no row matches)	`"Unknown"` (default)

The same null-sentinel works in the portal’s rule builder — leave the “When scraped value is” cell blank and type the desired output in “Save as”. The empty-key row shows a (matches missing / null) placeholder so it’s visually distinguishable from a half-finished add.

`REGEX_CLEAN` — keep only characters matching a safe character class

Param	Required	Description
`keep`	no	Character class to keep, e.g. `[0-9]`, `[A-Za-z0-9]`, or a bare literal like `0-9` (auto-wrapped in `[…]`). Default `[A-Za-z0-9]`.

Null source: no-op — the output field is not written.

Safety: the keep pattern is restricted to simple character classes — groups (), quantifiers *+?, alternation |, and curly braces {} are all rejected to prevent catastrophic backtracking. Unsafe pattern → 200 { "error": "Unsafe REGEX_CLEAN 'keep' pattern" }.


POST /api/scraper/preview-rule
Authorization: Bearer <key>
Content-Type: application/json
 
{
  "rule": {
    "kind": "REGEX_CLEAN",
    "sourceField": "price",
    "outputField": "priceClean",
    "outputType": "TEXT",
    "params": { "keep": "[0-9.]" }
  },
  "sampleValue": "$1,234.56"
}

Response: { "output": "1234.56" }.

Cross-rule null-source posture

Rule kind	Null source behaviour
`CURRENCY_CONVERT`	No-op — output field not written
`ENUM_MAP`	Looks up the `""` key (null-sentinel) — produces `default` if `""` is not in the mapping
`REGEX_CLEAN`	No-op — output field not written

ENUM_MAP is the only LOOKUP rule of the three (the others are transforms), which is why it’s the only one with null-sentinel semantics. If you need True/False from null/non-null, ENUM_MAP is the rule.

Errors

On invalid rule params, the endpoint returns 200 with an error string instead of throwing — intentional non-throw shape so agents can iterate on rule params without exception handling.

400 (request body validation) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500. Rule-execution failures surface as 200 { "error": "..." } in the body, NOT as an HTTP error.

SEO fields

Get SEO fields for a scraper — `GET /api/scraper/{id}/seo-fields`

Returns SEO metadata (title, description, og:* tags) extracted from the scraper’s configured source URL. Cached at the SEO field service layer.


curl -H "Authorization: Bearer $KEY" \
  https://portal.scrapewise.ai/api/scraper-api/api/scraper/$SCRAPER_ID/seo-fields

Response (200) — List<SeoField> (name + content per discovered meta tag). Empty list if no extractable meta tags.

Errors — 400 (scraper_not_found) / 401 / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

Get SEO fields for an arbitrary URL — `POST /api/scraper/seo-fields`

Same extraction logic but takes a URL in the body — no pre-existing scraper required. Subject to the SSRF deny-list.


POST /api/scraper/seo-fields
Authorization: Bearer <key>
Content-Type: application/json
 
{ "url": "https://example.com" }

Response (200) — List<SeoField>.

Errors — 400 (SSRF_BLOCKED for internal / link-local / cloud-metadata hosts) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500. Also 502 if upstream unreachable.

Group-level run + stop

Run every scraper in a group — `GET /api/scraper/group/{id}/run`

Triggers a manual run of EVERY scraper that belongs to the group. Each run is queued asynchronously. Requires the RUN_WITH_GROUP plan feature.


curl -H "Authorization: Bearer $KEY" \
  -H "Idempotency-Key: $(uuidgen)" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper/group/grp_xyz/run?mergeData=true"

Param	Default	Description
`mergeData`	`false`	After all per-scraper runs complete, merge results into the group’s enriched-sibling collection.

Response — 204 No Content. Poll GET /api/scraper/load-history per scraper to check status.

Errors — 400 (group not found) / 401 / 402 (plan lacks RUN_WITH_GROUP) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

Stop every running scraper in a group — `GET /api/scraper/group/{id}/stop`

Stops every scraper in the group that’s currently RUNNING. Scrapers not in RUNNING state are silently skipped. Partial scraped data already written is kept — clear it via DELETE /api/scraper/data per scraperJobStatusId if needed.


curl -H "Authorization: Bearer $KEY" \
  -H "Idempotency-Key: $(uuidgen)" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper/group/grp_xyz/stop"

Response — 200 OK (empty body). State transitions complete synchronously.

Errors — 400 (group not found) / 401 / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

Fallback scraper — `PUT /api/scraper/fallback`

Attach a fallback scraper to a SINGLE_PRODUCT primary scraper. The fallback runs automatically when the primary extracts no data for a link (e.g. a JSON-LD parser as a backup to CSS-selector extraction). Requires the FALLBACK_SCRAPER plan feature.


PUT /api/scraper/fallback
Authorization: Bearer <key>
Idempotency-Key: <uuid>
Content-Type: application/json
 
{
  "scraperId": "scr_abc123",
  "fallback": { /* full ScraperDTO of the fallback scraper */ }
}

Response (200) — the persisted fallback ScraperDTO with its assigned id, or null on silent failure (rare; usually surfaces as a 4xx instead).

Errors — 400 (primary scraper isn’t SINGLE_PRODUCT type / scraperId doesn’t exist) / 401 / 402 (plan lacks FALLBACK_SCRAPER) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

Debug: load raw site HTML — `GET /api/scraper/load-site` + `/load-site-content`

Two thin debug fetchers that go through the proxy infrastructure + SSRF deny-list + customer rate limit. Use to inspect a page’s structure when designing a scraper config or debugging selectors.


# Raw text/html
curl -H "Authorization: Bearer $KEY" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper/load-site?url=https://example.com&useCache=true"
 
# JSON envelope { "content": "<html>" } — for MCP-style tooling
curl -H "Authorization: Bearer $KEY" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper/load-site-content?url=https://example.com"

Param	Description
`url`	Required. Public HTTP(S). Internal / link-local / cloud-metadata hosts are SSRF-blocked.
`cookiesAcceptSelector`	Optional CSS selector for a cookie-banner accept button (rendered fetch only).
`useCache`	Optional; return cached HTML if available (1h TTL).

Errors (both) — 400 (SSRF_BLOCKED) / 401 / 403 (N/A) / 404 (N/A) / 429 (rate limit) / 500. Also 502 if upstream unreachable.

Config parameter catalogue — `GET /api/scraper/config/parameters`

Returns the canonical enum lists used by scraper config builders: scraper types, pagination types, parameter types, post-process kinds, and the common-parameters catalogue. Used by the scraper-builder UI; usually not interesting to API consumers unless you’re building your own builder.


curl -H "Authorization: Bearer $KEY" \
  https://portal.scrapewise.ai/api/scraper-api/api/scraper/config/parameters

Response (200) — ScraperConfigParametersDTO (scraperType, paginationType, parameterType, commonParameters, postProcessKind).

Errors — 400 (N/A) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500.

Common errors

Code	Meaning
`scraper_not_found`	No scraper with that id exists for your customer
`scraper_already_running`	Trying to trigger a run while previous job is still in flight (use `cancel` first, or wait for SSE `SUCCEEDED`)
`schema_validation_failed`	The schema you passed has invalid types or unsupported field names
`start_urls_unreachable`	None of the start URLs returned 2xx — check DNS / target availability

See Errors for the full envelope format.

What’s next

Read what the scraper produced → Data
One-shot preview without saving a scraper → Preview
Use these tools from Claude → MCP quickstart

Scrapers

Create or update — PUT /api/scraper

List — GET /api/scraper/list

Get — GET /api/scraper/{id}

V2: enriched create/update + get — PUT /api/scraper/v2 + GET /api/scraper/v2/{id}

Delete a scraper (destructive — two-call protocol) — DELETE /api/scraper/{id}

Step 1 — Preview

Step 2 — Commit

Auto-refresh values — keeping a cURL alive across the site’s deploys

1. Zero-config Next.js auto-recovery (no setup needed)

2. Explicit sourceConfig.prelaunch (for custom rotating tokens)

Trigger a run — GET /api/scraper/{id}/run

Idempotency for runs

Stream job status — GET /api/scraper-job-status/{id}/stream

Node.js SSE example

Python SSE example

Stop a running scraper — GET /api/scraper/{id}/stop

Sample data — GET /api/scraper/{id}/get-sample-data

Preview a post-process rule — POST /api/scraper/preview-rule