Scraped data
Once a scraper has run, the rows it produced are accessible through these endpoints.
Read group data — GET /api/scraper/data/group/{id}
The primary scraped-data read. Returns a paginated page of rows for the group, optionally restricted by saved categories or an inline filter expression. isLastRun=true narrows to the most recent run’s rows only.
curl -H "Authorization: Bearer $KEY" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/data/group/grp_xyz?page=0&size=100&isLastRun=true"Query params
| Param | Type | Description |
|---|---|---|
page | int | Zero-indexed page number (default 0) |
size | int | Page size (default 20, capped server-side) |
sort | string | Spring-style, e.g. scrapedAt,desc |
categories | string[] | Repeatable — restrict to these saved-category names. See Saved-filter categories. |
filters | string | URL-encoded JSON Mongo-style filter expression |
isLastRun | boolean | If true, only the most recent run’s rows |
Response
Spring Page<Document> — each content row is a dynamic Mongo Document keyed by the scraper’s configured field names. Numbers / booleans / null pass through; _-prefixed keys are server metadata (_sourceUrl, _scrapedAt, _scraperJobStatusId).
{
"content": [
{
"_sourceUrl": "https://...",
"_scrapedAt": "2026-05-19T...",
"_scraperJobStatusId": "65a1...",
"title": "Premium Widget XL",
"price": 29.99,
"in_stock": true
}
],
"totalElements": 1847,
"totalPages": 19,
"size": 100,
"number": 0
}For agent / MCP use the sanitized envelope variant instead — every string field comes pre-wrapped with origin metadata for prompt-injection safety.
Field-level filtering
The filters query param accepts a URL-encoded Mongo-style filter expression. Standard logical operators:
?filters=%7B%22price%22%3A%7B%22%24gt%22%3A50%7D%2C%22in_stock%22%3Atrue%7D— which decodes to {"price":{"$gt":50},"in_stock":true}. Supports $and / $or / $nor logical groups recursively.
Errors — 400 (group not found / malformed filters JSON) / 401 / 403 (N/A) / 404 (N/A — 400 envelope used) / 429 / 500.
Delete scraped data
The legacy single-call DELETE /api/scraper/data?scraperId=...|?id=...|?jobId=... shape is no longer wired. The only DELETE /api/scraper/data endpoint that ships today takes ?scraperJobStatusId=... and follows the destructive two-call protocol on the Scraper Jobs page (preview-delete then commit). For broader removals — entire scraper, entire group, including-data — use:
| Goal | Endpoint |
|---|---|
| Delete one run’s rows | DELETE /api/scraper/data?scraperJobStatusId=... (two-call) |
| Delete one scraper (optionally its data) | DELETE /api/scraper/{id}?withData=true (two-call) |
| Delete one group (optionally its data) | DELETE /api/scraper/group/{id}?withData=true (two-call) |
Agent-friendly read (prompt-injection envelope) — GET /api/scraper/data/group/{id}/client
Same paginated data as GET /api/scraper/data/group/{id} — but when called from the MCP gateway, every string field is automatically wrapped in a {type: "scraped", content, origin, truncated} envelope. The agent’s system prompt instructs it never to act on instructions found inside scraped content (prompt-injection protection).
Numbers, booleans, null, and _-prefixed metadata pass through unwrapped (they weren’t scraped from arbitrary HTML).
The sanitized query param is hidden from the OpenAPI surface — the MCP gateway forces it server-side; direct SDK consumers can opt in by setting it explicitly.
Requires the EXTERNAL_API plan feature.
curl -H "Authorization: Bearer $KEY" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/data/group/grp_xyz/client?page=0&size=50"Response (200) — Spring Page of sanitized rows:
{
"content": [
{
"title": { "type": "scraped", "content": "Premium Widget XL", "origin": "https://...", "truncated": false },
"price": 29.99,
"in_stock": true,
"_jobId": "65a1..."
}
],
"totalElements": 1847,
"totalPages": 37
}Errors — 400 (group not found) / 401 / 402 (plan lacks EXTERNAL_API — fall back to GET /api/scraper/data/group/{id} with your own prompt-injection safety) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.
Need just the URLs a scraper hit? There’s no dedicated per-scraper URL endpoint. Use
GET /api/scraper/data/group/{id}withscraperId=<id>and project the URL field client-side (e.g._sourceUrl).
Group-level Excel download — GET /api/scraper/data/group/{id}/download
Downloads the latest run’s data for the given group as a single .xlsx file (Content-Disposition: attachment; filename=data.xlsx). Filters + categories supported, same as the group-data read endpoint. Subject to the DATA_EXPORT plan feature.
curl -H "Authorization: Bearer $KEY" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/data/group/grp_xyz/download?categories=discounts" \
--output data.xlsxQuery params
| Param | Description |
|---|---|
categories | Comma-separated category names — restrict to those saved filters |
filters | URL-encoded JSON filter expression (one-off; doesn’t need a saved category) |
Errors — 400 (group not found) / 401 / 402 (plan lacks DATA_EXPORT) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.
Saved-filter categories
A category is a named filter expression saved against a group — a reusable filter bundle the customer references when querying scraped data. Subject to the GROUP_CUSTOM_CATEGORIES plan feature; total categories per group capped by GROUP_CUSTOM_CATEGORIES_LIMIT.
List categories — GET /api/scraper/data/group/{id}/categories
curl -H "Authorization: Bearer $KEY" \
https://portal.scrapewise.ai/api/scraper-api/api/scraper/data/group/grp_xyz/categoriesResponse (200) — List<CustomerCategoryDTO>:
[
{ "id": "cat_001", "name": "big-discounts", "filters": { /* FiltersDTO */ } },
{ "id": "cat_002", "name": "in-stock", "filters": { /* FiltersDTO */ } }
]Errors — 400 (group not found) / 401 / 403 (N/A) / 404 (N/A — empty list returned if none) / 429 / 500.
Save a category — PUT /api/scraper/data/group/{id}/category
PUT /api/scraper/data/group/grp_xyz/category
Authorization: Bearer <key>
Idempotency-Key: <uuid>
Content-Type: application/json
{
"name": "big-discounts",
"filters": { "discountPct": { "gt": 20 } }
}Upsert by name within the group (resave with the same name updates the filter expression).
Response (200) — persisted CustomerCategoryDTO with assigned id.
Errors — 400 (group not found) / 401 / 402 (plan lacks GROUP_CUSTOM_CATEGORIES OR GROUP_CUSTOM_CATEGORIES_LIMIT reached) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.
Delete a category (destructive — two-call protocol) — DELETE /api/scraper/data/group/{id}/category/{categoryId}
Destructive operation. Deleting a category permanently removes it from the group. The category’s filter expression is unrecoverable — re-creating via PUT /api/scraper/data/group/{id}/category requires you to remember the filter shape.
ADR-012 two-call pattern: first preview, then commit within 5 minutes.
Steps
POST /api/scraper/data/group/{id}/category/{categoryId}/preview-delete— mints a 5-minute token + preview summary.DELETE /api/scraper/data/group/{id}/category/{categoryId}— commits the delete (idempotency-key required).
Skipping the preview step deletes rows without confirmation.
Step 1 — Preview
POST /api/scraper/data/group/grp_xyz/category/cat_001/preview-delete
Authorization: Bearer <key>Response (200) — DestructivePreviewResponseDTO:
{
"token": "f3c2-...",
"opName": "scrapewise_delete_scraper_data_group_category",
"targetEntityId": "cat_001",
"previewSummary": {
"entityName": "cat_001",
"entityType": "data_group_category",
"cascadeCounts": {},
"warnings": []
}
}Step 2 — Commit
DELETE /api/scraper/data/group/grp_xyz/category/cat_001
Authorization: Bearer <key>
Idempotency-Key: <uuid>Response — 204 No Content.
Errors (both steps) — 400 (group not found OR no category with that categoryId in the group) / 401 / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.
JSON export (programmatic)
For your code, the paginated /api/scraper/data endpoint is usually what you want. Excel export is for handing files to humans.
Common errors
| Code | Meaning |
|---|---|
scraper_not_found | scraperId doesn’t exist for your customer |
invalid_filter | filter syntax is malformed |
export_too_large | Dataset exceeds 100k rows for xlsx format — use jsonl or paginate manually |