Scraper jobs
A job (technically scraperJobStatus) is one execution attempt of a scraper. Each job carries its state (PENDING, RUNNING, COMPLETED, FAILED, STOPPED), timestamps, link counts, and any per-link errors. This page covers:
- Reading the paginated job history (filterable by group / scraper / state / date).
- Drilling into a specific job’s failed-link list.
- Merging multiple jobs’ results into an enriched dataset (
DATA_ENRICHMENTfeature). - Destructive: deleting the scraped rows produced by a single job.
Endpoint summary
| Method | Path | Operation ID | Auth scope |
|---|---|---|---|
| GET | /api/scraper/load-history | scrapewise_get_scraper_load_history | bearer |
| GET | /api/scraper/load-history/group/{groupId}/job/{jobId}/errors | scrapewise_get_scraper_job_errors | bearer |
| PUT | /api/scraper/data/group/{id}/merge | scrapewise_run_scraper_data_group | bearer + DATA_ENRICHMENT + idempotency-key |
| POST | /api/scraper/data/preview-delete | scrapewise_delete_scraper_data_preview | bearer |
| DELETE | /api/scraper/data | scrapewise_delete_scraper_data | bearer + idempotency-key |
List job history — GET /api/scraper/load-history
curl -H "Authorization: Bearer $KEY" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/load-history?page=0&size=20&filters=%7B%22state%22%3A%22FAILED%22%7D"Paginated history of all scraper runs for the authenticated customer. Default sort: started DESC.
Query params
| Param | Type | Description |
|---|---|---|
filters | URL-encoded JSON | Common keys: groupId, scraperId, state. Empty = all runs. |
withDeleted | boolean | Include soft-deleted runs (default false) |
page / size / sort | Spring Pageable | Standard pagination |
Response (200) — Page<ScraperJobStatusDTO>:
{
"content": [
{
"id": "65a1...",
"scraperId": "scr_abc...",
"groupId": "grp_xyz...",
"state": "COMPLETED",
"started": "2026-05-19T10:00:00Z",
"finished": "2026-05-19T10:05:23Z",
"successLinks": 198,
"errorLinks": 2,
"launchedByCustomer": true
}
],
"totalElements": 42,
"totalPages": 3,
"size": 20,
"number": 0
}launchedByCustomer distinguishes “I started this run” from “the group’s owning customer started it” (relevant for shared-group runs — see Shared groups).
Errors — 400 (malformed filters JSON) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500.
Cheaper alternative: for a single scraper’s most-recent state, read
lastRunStateon the scraper resource directly viaGET /api/scraper/{id}.
Get a job’s per-link errors — GET /api/scraper/load-history/group/{groupId}/job/{jobId}/errors
curl -H "Authorization: Bearer $KEY" \
https://portal.scrapewise.ai/api/scraper-api/api/scraper/load-history/group/grp_xyz/job/65a1.../errorsReturns the list of links that failed (HTTP error, parser error, timeout) during the given job. Use to diagnose why a run produced fewer rows than expected.
The groupId path component is used for cross-tenant ownership checks on shared groups — pass the group id the scraper belongs to.
Response (200) — List<LinkErrorDTO>:
[
{ "url": "https://example.com/p/1", "status": 503, "error": "Upstream temporarily unavailable" },
{ "url": "https://example.com/p/2", "status": null, "error": "Timeout after 30s" }
]Empty list if all links succeeded.
Errors — 400 (no run with that jobId for this customer/group) / 401 / 403 (N/A) / 404 (N/A — 400 used instead) / 429 / 500.
Merge jobs into an enriched dataset — PUT /api/scraper/data/group/{id}/merge
PUT /api/scraper/data/group/grp_xyz/merge?scraperJobStatusIds=65a1...&scraperJobStatusIds=65a2...&scraperJobStatusIds=65a3...
Authorization: Bearer <key>
Idempotency-Key: <uuid>Combines the scraped rows from N scraper runs (each from a different scraper) into one enriched dataset, matched on the title field. The title field must be a stable identifier like EAN / barcode (NOT a constant) for the merge to do useful work.
The result lands in the group’s enriched-sibling collection (same reference UUID, group name + _enriched). Requires the DATA_ENRICHMENT plan feature. Only the latest COMPLETED run per scraper is accepted.
Response (200)
{ "mergedItems": 184 }Errors
| Code | Meaning |
|---|---|
| 400 | Group doesn’t exist for this customer |
| 401 | Missing/invalid bearer |
| 402 | Plan lacks DATA_ENRICHMENT |
| 403 | N/A |
| 404 | N/A (400 used instead) |
| 422 | Zero rows merged — none of the supplied runs is the LATEST COMPLETED for its scraper, or rows have no matching title |
| 429 | Rate-limited |
| 500 | Malformed scraperJobStatusId (no upfront validation — surfaces here) / unexpected error |
Delete a run’s data (destructive — two-call protocol) — DELETE /api/scraper/data
Destructive operation. Deletes every scraped row produced by a single run from the customer’s group collection. The scraper config and the scraperJobStatus record itself are preserved — only the actual product/row data is removed. Irreversible.
ADR-012 two-call pattern: first preview, then commit within 5 minutes.
Steps
POST /api/scraper/data/preview-delete?scraperJobStatusId=...— mints a 5-minute token + preview summary.DELETE /api/scraper/data?scraperJobStatusId=...— commits the delete (idempotency-key required).
Skipping the preview step deletes rows without confirmation.
Step 1 — Preview
POST /api/scraper/data/preview-delete?scraperJobStatusId=65a1...
Authorization: Bearer <key>Response (200)
{
"token": "7c8d-...",
"opName": "scrapewise_delete_scraper_data",
"targetEntityId": "65a1...",
"previewSummary": {
"entityName": "65a1...",
"entityType": "scraper_run_data",
"cascadeCounts": {},
"warnings": [
"Scraped product rows produced by this run will be permanently deleted (irreversible)."
]
}
}Render the preview to the user. Only proceed to step 2 on explicit approval.
Step 2 — Commit
DELETE /api/scraper/data?scraperJobStatusId=65a1...
Authorization: Bearer <key>
Idempotency-Key: <uuid>Response — 204 No Content.
Errors (both steps)
| Code | Meaning |
|---|---|
| 400 | scraperJobStatusId is not a valid ObjectId / no run with that id for this customer |
| 401 | Missing/invalid bearer |
| 403 | N/A |
| 404 | N/A (400 used instead) |
| 429 | Rate-limited |
| 500 | Persistence failure mid-delete |
See also
- Scrapers —
scraperJobStatusrows are produced byGET /api/scraper/{id}/run - Scraped data — read the merged enriched dataset
- Groups —
withData=trueon group delete removes the whole collection - Plan features — which tiers unlock
DATA_ENRICHMENT