Scraped data

Once a scraper has run, the rows it produced are accessible through these endpoints.

Read group data — `GET /api/scraper/data/group/{id}`

The primary scraped-data read. Returns a paginated page of rows for the group, optionally restricted by saved categories or an inline filter expression. isLastRun=true narrows to the most recent run’s rows only.


curl -H "Authorization: Bearer $KEY" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper/data/group/grp_xyz?page=0&size=100&isLastRun=true"

Query params

Param	Type	Description
`page`	int	Zero-indexed page number (default 0)
`size`	int	Page size (default 20, capped server-side)
`sort`	string	Spring-style, e.g. `scrapedAt,desc`
`categories`	string[]	Repeatable — restrict to these saved-category names. See Saved-filter categories.
`filters`	string	URL-encoded JSON Mongo-style filter expression
`isLastRun`	boolean	If `true`, only the most recent run’s rows

Response

Spring Page<Document> — each content row is a dynamic Mongo Document keyed by the scraper’s configured field names. Numbers / booleans / null pass through; _-prefixed keys are server metadata (_sourceUrl, _scrapedAt, _scraperJobStatusId).


{
  "content": [
    {
      "_sourceUrl": "https://...",
      "_scrapedAt": "2026-05-19T...",
      "_scraperJobStatusId": "65a1...",
      "title": "Premium Widget XL",
      "price": 29.99,
      "in_stock": true
    }
  ],
  "totalElements": 1847,
  "totalPages": 19,
  "size": 100,
  "number": 0
}

For agent / MCP use the sanitized envelope variant instead — every string field comes pre-wrapped with origin metadata for prompt-injection safety.

Field-level filtering

The filters query param accepts a URL-encoded Mongo-style filter expression. Standard logical operators:


?filters=%7B%22price%22%3A%7B%22%24gt%22%3A50%7D%2C%22in_stock%22%3Atrue%7D

— which decodes to {"price":{"$gt":50},"in_stock":true}. Supports $and / $or / $nor logical groups recursively.

Errors — 400 (group not found / malformed filters JSON) / 401 / 403 (N/A) / 404 (N/A — 400 envelope used) / 429 / 500.

Delete scraped data

The legacy single-call DELETE /api/scraper/data?scraperId=...|?id=...|?jobId=... shape is no longer wired. The only DELETE /api/scraper/data endpoint that ships today takes ?scraperJobStatusId=... and follows the destructive two-call protocol on the Scraper Jobs page (preview-delete then commit). For broader removals — entire scraper, entire group, including-data — use:

Goal	Endpoint
Delete one run’s rows	`DELETE /api/scraper/data?scraperJobStatusId=...` (two-call)
Delete one scraper (optionally its data)	`DELETE /api/scraper/{id}?withData=true` (two-call)
Delete one group (optionally its data)	`DELETE /api/scraper/group/{id}?withData=true` (two-call)

Agent-friendly read (prompt-injection envelope) — `GET /api/scraper/data/group/{id}/client`

Same paginated data as GET /api/scraper/data/group/{id} — but when called from the MCP gateway, every string field is automatically wrapped in a {type: "scraped", content, origin, truncated} envelope. The agent’s system prompt instructs it never to act on instructions found inside scraped content (prompt-injection protection).

Numbers, booleans, null, and _-prefixed metadata pass through unwrapped (they weren’t scraped from arbitrary HTML).

The sanitized query param is hidden from the OpenAPI surface — the MCP gateway forces it server-side; direct SDK consumers can opt in by setting it explicitly.

Requires the EXTERNAL_API plan feature.


curl -H "Authorization: Bearer $KEY" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper/data/group/grp_xyz/client?page=0&size=50"

Response (200) — Spring Page of sanitized rows:


{
  "content": [
    {
      "title":    { "type": "scraped", "content": "Premium Widget XL", "origin": "https://...", "truncated": false },
      "price":    29.99,
      "in_stock": true,
      "_jobId":   "65a1..."
    }
  ],
  "totalElements": 1847,
  "totalPages": 37
}

Errors — 400 (group not found) / 401 / 402 (plan lacks EXTERNAL_API — fall back to GET /api/scraper/data/group/{id} with your own prompt-injection safety) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

Need just the URLs a scraper hit? There’s no dedicated per-scraper URL endpoint. Use GET /api/scraper/data/group/{id} with scraperId=<id> and project the URL field client-side (e.g. _sourceUrl).

Group-level Excel download — `GET /api/scraper/data/group/{id}/download`

Downloads the latest run’s data for the given group as a single .xlsx file (Content-Disposition: attachment; filename=data.xlsx). Filters + categories supported, same as the group-data read endpoint. Subject to the DATA_EXPORT plan feature.


curl -H "Authorization: Bearer $KEY" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper/data/group/grp_xyz/download?categories=discounts" \
  --output data.xlsx

Query params

Param	Description
`categories`	Comma-separated category names — restrict to those saved filters
`filters`	URL-encoded JSON filter expression (one-off; doesn’t need a saved category)

Errors — 400 (group not found) / 401 / 402 (plan lacks DATA_EXPORT) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

Saved-filter categories

A category is a named filter expression saved against a group — a reusable filter bundle the customer references when querying scraped data. Subject to the GROUP_CUSTOM_CATEGORIES plan feature; total categories per group capped by GROUP_CUSTOM_CATEGORIES_LIMIT.

List categories — `GET /api/scraper/data/group/{id}/categories`


curl -H "Authorization: Bearer $KEY" \
  https://portal.scrapewise.ai/api/scraper-api/api/scraper/data/group/grp_xyz/categories

Response (200) — List<CustomerCategoryDTO>:


[
  { "id": "cat_001", "name": "big-discounts", "filters": { /* FiltersDTO */ } },
  { "id": "cat_002", "name": "in-stock",      "filters": { /* FiltersDTO */ } }
]

Errors — 400 (group not found) / 401 / 403 (N/A) / 404 (N/A — empty list returned if none) / 429 / 500.

Save a category — `PUT /api/scraper/data/group/{id}/category`


PUT /api/scraper/data/group/grp_xyz/category
Authorization: Bearer <key>
Idempotency-Key: <uuid>
Content-Type: application/json
 
{
  "name": "big-discounts",
  "filters": { "discountPct": { "gt": 20 } }
}

Upsert by name within the group (resave with the same name updates the filter expression).

Response (200) — persisted CustomerCategoryDTO with assigned id.

Errors — 400 (group not found) / 401 / 402 (plan lacks GROUP_CUSTOM_CATEGORIES OR GROUP_CUSTOM_CATEGORIES_LIMIT reached) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

Delete a category (destructive — two-call protocol) — `DELETE /api/scraper/data/group/{id}/category/{categoryId}`

Destructive operation. Deleting a category permanently removes it from the group. The category’s filter expression is unrecoverable — re-creating via PUT /api/scraper/data/group/{id}/category requires you to remember the filter shape.

ADR-012 two-call pattern: first preview, then commit within 5 minutes.

Steps

POST /api/scraper/data/group/{id}/category/{categoryId}/preview-delete — mints a 5-minute token + preview summary.
DELETE /api/scraper/data/group/{id}/category/{categoryId} — commits the delete (idempotency-key required).

Skipping the preview step deletes rows without confirmation.

Step 1 — Preview


POST /api/scraper/data/group/grp_xyz/category/cat_001/preview-delete
Authorization: Bearer <key>

Response (200) — DestructivePreviewResponseDTO:


{
  "token": "f3c2-...",
  "opName": "scrapewise_delete_scraper_data_group_category",
  "targetEntityId": "cat_001",
  "previewSummary": {
    "entityName": "cat_001",
    "entityType": "data_group_category",
    "cascadeCounts": {},
    "warnings": []
  }
}

Step 2 — Commit


DELETE /api/scraper/data/group/grp_xyz/category/cat_001
Authorization: Bearer <key>
Idempotency-Key: <uuid>

Response — 204 No Content.

Errors (both steps) — 400 (group not found OR no category with that categoryId in the group) / 401 / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

JSON export (programmatic)

For your code, the paginated /api/scraper/data endpoint is usually what you want. Excel export is for handing files to humans.

Common errors

Code	Meaning
`scraper_not_found`	`scraperId` doesn’t exist for your customer
`invalid_filter`	`filter` syntax is malformed
`export_too_large`	Dataset exceeds 100k rows for `xlsx` format — use `jsonl` or paginate manually

What’s next

Run a scraper to produce data → Scrapers
API key management → API keys
Full reference → Reference