Skip to Content
REST APIScraped data

Scraped data

Once a scraper has run, the rows it produced are accessible through these endpoints.

Read group data — GET /api/scraper/data/group/{id}

The primary scraped-data read. Returns a paginated page of rows for the group, optionally restricted by saved categories or an inline filter expression. isLastRun=true narrows to the most recent run’s rows only.

curl -H "Authorization: Bearer $KEY" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/data/group/grp_xyz?page=0&size=100&isLastRun=true"

Query params

ParamTypeDescription
pageintZero-indexed page number (default 0)
sizeintPage size (default 20, capped server-side)
sortstringSpring-style, e.g. scrapedAt,desc
categoriesstring[]Repeatable — restrict to these saved-category names. See Saved-filter categories.
filtersstringURL-encoded JSON Mongo-style filter expression
isLastRunbooleanIf true, only the most recent run’s rows

Response

Spring Page<Document> — each content row is a dynamic Mongo Document keyed by the scraper’s configured field names. Numbers / booleans / null pass through; _-prefixed keys are server metadata (_sourceUrl, _scrapedAt, _scraperJobStatusId).

{ "content": [ { "_sourceUrl": "https://...", "_scrapedAt": "2026-05-19T...", "_scraperJobStatusId": "65a1...", "title": "Premium Widget XL", "price": 29.99, "in_stock": true } ], "totalElements": 1847, "totalPages": 19, "size": 100, "number": 0 }

For agent / MCP use the sanitized envelope variant instead — every string field comes pre-wrapped with origin metadata for prompt-injection safety.

Field-level filtering

The filters query param accepts a URL-encoded Mongo-style filter expression. Standard logical operators:

?filters=%7B%22price%22%3A%7B%22%24gt%22%3A50%7D%2C%22in_stock%22%3Atrue%7D

— which decodes to {"price":{"$gt":50},"in_stock":true}. Supports $and / $or / $nor logical groups recursively.

Errors — 400 (group not found / malformed filters JSON) / 401 / 403 (N/A) / 404 (N/A — 400 envelope used) / 429 / 500.

Delete scraped data

The legacy single-call DELETE /api/scraper/data?scraperId=...|?id=...|?jobId=... shape is no longer wired. The only DELETE /api/scraper/data endpoint that ships today takes ?scraperJobStatusId=... and follows the destructive two-call protocol on the Scraper Jobs page (preview-delete then commit). For broader removals — entire scraper, entire group, including-data — use:

GoalEndpoint
Delete one run’s rowsDELETE /api/scraper/data?scraperJobStatusId=... (two-call)
Delete one scraper (optionally its data)DELETE /api/scraper/{id}?withData=true (two-call)
Delete one group (optionally its data)DELETE /api/scraper/group/{id}?withData=true (two-call)

Agent-friendly read (prompt-injection envelope) — GET /api/scraper/data/group/{id}/client

Same paginated data as GET /api/scraper/data/group/{id} — but when called from the MCP gateway, every string field is automatically wrapped in a {type: "scraped", content, origin, truncated} envelope. The agent’s system prompt instructs it never to act on instructions found inside scraped content (prompt-injection protection).

Numbers, booleans, null, and _-prefixed metadata pass through unwrapped (they weren’t scraped from arbitrary HTML).

The sanitized query param is hidden from the OpenAPI surface — the MCP gateway forces it server-side; direct SDK consumers can opt in by setting it explicitly.

Requires the EXTERNAL_API plan feature.

curl -H "Authorization: Bearer $KEY" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/data/group/grp_xyz/client?page=0&size=50"

Response (200) — Spring Page of sanitized rows:

{ "content": [ { "title": { "type": "scraped", "content": "Premium Widget XL", "origin": "https://...", "truncated": false }, "price": 29.99, "in_stock": true, "_jobId": "65a1..." } ], "totalElements": 1847, "totalPages": 37 }

Errors — 400 (group not found) / 401 / 402 (plan lacks EXTERNAL_API — fall back to GET /api/scraper/data/group/{id} with your own prompt-injection safety) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

Need just the URLs a scraper hit? There’s no dedicated per-scraper URL endpoint. Use GET /api/scraper/data/group/{id} with scraperId=<id> and project the URL field client-side (e.g. _sourceUrl).

Group-level Excel download — GET /api/scraper/data/group/{id}/download

Downloads the latest run’s data for the given group as a single .xlsx file (Content-Disposition: attachment; filename=data.xlsx). Filters + categories supported, same as the group-data read endpoint. Subject to the DATA_EXPORT plan feature.

curl -H "Authorization: Bearer $KEY" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/data/group/grp_xyz/download?categories=discounts" \ --output data.xlsx

Query params

ParamDescription
categoriesComma-separated category names — restrict to those saved filters
filtersURL-encoded JSON filter expression (one-off; doesn’t need a saved category)

Errors — 400 (group not found) / 401 / 402 (plan lacks DATA_EXPORT) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

Saved-filter categories

A category is a named filter expression saved against a group — a reusable filter bundle the customer references when querying scraped data. Subject to the GROUP_CUSTOM_CATEGORIES plan feature; total categories per group capped by GROUP_CUSTOM_CATEGORIES_LIMIT.

List categories — GET /api/scraper/data/group/{id}/categories

curl -H "Authorization: Bearer $KEY" \ https://portal.scrapewise.ai/api/scraper-api/api/scraper/data/group/grp_xyz/categories

Response (200)List<CustomerCategoryDTO>:

[ { "id": "cat_001", "name": "big-discounts", "filters": { /* FiltersDTO */ } }, { "id": "cat_002", "name": "in-stock", "filters": { /* FiltersDTO */ } } ]

Errors — 400 (group not found) / 401 / 403 (N/A) / 404 (N/A — empty list returned if none) / 429 / 500.

Save a category — PUT /api/scraper/data/group/{id}/category

PUT /api/scraper/data/group/grp_xyz/category Authorization: Bearer <key> Idempotency-Key: <uuid> Content-Type: application/json { "name": "big-discounts", "filters": { "discountPct": { "gt": 20 } } }

Upsert by name within the group (resave with the same name updates the filter expression).

Response (200) — persisted CustomerCategoryDTO with assigned id.

Errors — 400 (group not found) / 401 / 402 (plan lacks GROUP_CUSTOM_CATEGORIES OR GROUP_CUSTOM_CATEGORIES_LIMIT reached) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

Delete a category (destructive — two-call protocol) — DELETE /api/scraper/data/group/{id}/category/{categoryId}

Destructive operation. Deleting a category permanently removes it from the group. The category’s filter expression is unrecoverable — re-creating via PUT /api/scraper/data/group/{id}/category requires you to remember the filter shape.

ADR-012 two-call pattern: first preview, then commit within 5 minutes.

Steps

  1. POST /api/scraper/data/group/{id}/category/{categoryId}/preview-delete — mints a 5-minute token + preview summary.
  2. DELETE /api/scraper/data/group/{id}/category/{categoryId} — commits the delete (idempotency-key required).

Skipping the preview step deletes rows without confirmation.

Step 1 — Preview

POST /api/scraper/data/group/grp_xyz/category/cat_001/preview-delete Authorization: Bearer <key>

Response (200)DestructivePreviewResponseDTO:

{ "token": "f3c2-...", "opName": "scrapewise_delete_scraper_data_group_category", "targetEntityId": "cat_001", "previewSummary": { "entityName": "cat_001", "entityType": "data_group_category", "cascadeCounts": {}, "warnings": [] } }

Step 2 — Commit

DELETE /api/scraper/data/group/grp_xyz/category/cat_001 Authorization: Bearer <key> Idempotency-Key: <uuid>

Response204 No Content.

Errors (both steps) — 400 (group not found OR no category with that categoryId in the group) / 401 / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

JSON export (programmatic)

For your code, the paginated /api/scraper/data endpoint is usually what you want. Excel export is for handing files to humans.

Common errors

CodeMeaning
scraper_not_foundscraperId doesn’t exist for your customer
invalid_filterfilter syntax is malformed
export_too_largeDataset exceeds 100k rows for xlsx format — use jsonl or paginate manually

What’s next