Skip to Content
REST APIScraper jobs

Scraper jobs

A job (technically scraperJobStatus) is one execution attempt of a scraper. Each job carries its state (PENDING, RUNNING, COMPLETED, FAILED, STOPPED), timestamps, link counts, and any per-link errors. This page covers:

  • Reading the paginated job history (filterable by group / scraper / state / date).
  • Drilling into a specific job’s failed-link list.
  • Merging multiple jobs’ results into an enriched dataset (DATA_ENRICHMENT feature).
  • Destructive: deleting the scraped rows produced by a single job.

Endpoint summary

MethodPathOperation IDAuth scope
GET/api/scraper/load-historyscrapewise_get_scraper_load_historybearer
GET/api/scraper/load-history/group/{groupId}/job/{jobId}/errorsscrapewise_get_scraper_job_errorsbearer
PUT/api/scraper/data/group/{id}/mergescrapewise_run_scraper_data_groupbearer + DATA_ENRICHMENT + idempotency-key
POST/api/scraper/data/preview-deletescrapewise_delete_scraper_data_previewbearer
DELETE/api/scraper/datascrapewise_delete_scraper_databearer + idempotency-key

List job history — GET /api/scraper/load-history

curl -H "Authorization: Bearer $KEY" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/load-history?page=0&size=20&filters=%7B%22state%22%3A%22FAILED%22%7D"

Paginated history of all scraper runs for the authenticated customer. Default sort: started DESC.

Query params

ParamTypeDescription
filtersURL-encoded JSONCommon keys: groupId, scraperId, state. Empty = all runs.
withDeletedbooleanInclude soft-deleted runs (default false)
page / size / sortSpring PageableStandard pagination

Response (200)Page<ScraperJobStatusDTO>:

{ "content": [ { "id": "65a1...", "scraperId": "scr_abc...", "groupId": "grp_xyz...", "state": "COMPLETED", "started": "2026-05-19T10:00:00Z", "finished": "2026-05-19T10:05:23Z", "successLinks": 198, "errorLinks": 2, "launchedByCustomer": true } ], "totalElements": 42, "totalPages": 3, "size": 20, "number": 0 }

launchedByCustomer distinguishes “I started this run” from “the group’s owning customer started it” (relevant for shared-group runs — see Shared groups).

Errors — 400 (malformed filters JSON) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500.

Cheaper alternative: for a single scraper’s most-recent state, read lastRunState on the scraper resource directly via GET /api/scraper/{id}.

curl -H "Authorization: Bearer $KEY" \ https://portal.scrapewise.ai/api/scraper-api/api/scraper/load-history/group/grp_xyz/job/65a1.../errors

Returns the list of links that failed (HTTP error, parser error, timeout) during the given job. Use to diagnose why a run produced fewer rows than expected.

The groupId path component is used for cross-tenant ownership checks on shared groups — pass the group id the scraper belongs to.

Response (200)List<LinkErrorDTO>:

[ { "url": "https://example.com/p/1", "status": 503, "error": "Upstream temporarily unavailable" }, { "url": "https://example.com/p/2", "status": null, "error": "Timeout after 30s" } ]

Empty list if all links succeeded.

Errors — 400 (no run with that jobId for this customer/group) / 401 / 403 (N/A) / 404 (N/A — 400 used instead) / 429 / 500.

Merge jobs into an enriched dataset — PUT /api/scraper/data/group/{id}/merge

PUT /api/scraper/data/group/grp_xyz/merge?scraperJobStatusIds=65a1...&scraperJobStatusIds=65a2...&scraperJobStatusIds=65a3... Authorization: Bearer <key> Idempotency-Key: <uuid>

Combines the scraped rows from N scraper runs (each from a different scraper) into one enriched dataset, matched on the title field. The title field must be a stable identifier like EAN / barcode (NOT a constant) for the merge to do useful work.

The result lands in the group’s enriched-sibling collection (same reference UUID, group name + _enriched). Requires the DATA_ENRICHMENT plan feature. Only the latest COMPLETED run per scraper is accepted.

Response (200)

{ "mergedItems": 184 }

Errors

CodeMeaning
400Group doesn’t exist for this customer
401Missing/invalid bearer
402Plan lacks DATA_ENRICHMENT
403N/A
404N/A (400 used instead)
422Zero rows merged — none of the supplied runs is the LATEST COMPLETED for its scraper, or rows have no matching title
429Rate-limited
500Malformed scraperJobStatusId (no upfront validation — surfaces here) / unexpected error

Delete a run’s data (destructive — two-call protocol) — DELETE /api/scraper/data

Destructive operation. Deletes every scraped row produced by a single run from the customer’s group collection. The scraper config and the scraperJobStatus record itself are preserved — only the actual product/row data is removed. Irreversible.

ADR-012 two-call pattern: first preview, then commit within 5 minutes.

Steps

  1. POST /api/scraper/data/preview-delete?scraperJobStatusId=... — mints a 5-minute token + preview summary.
  2. DELETE /api/scraper/data?scraperJobStatusId=... — commits the delete (idempotency-key required).

Skipping the preview step deletes rows without confirmation.

Step 1 — Preview

POST /api/scraper/data/preview-delete?scraperJobStatusId=65a1... Authorization: Bearer <key>

Response (200)

{ "token": "7c8d-...", "opName": "scrapewise_delete_scraper_data", "targetEntityId": "65a1...", "previewSummary": { "entityName": "65a1...", "entityType": "scraper_run_data", "cascadeCounts": {}, "warnings": [ "Scraped product rows produced by this run will be permanently deleted (irreversible)." ] } }

Render the preview to the user. Only proceed to step 2 on explicit approval.

Step 2 — Commit

DELETE /api/scraper/data?scraperJobStatusId=65a1... Authorization: Bearer <key> Idempotency-Key: <uuid>

Response204 No Content.

Errors (both steps)

CodeMeaning
400scraperJobStatusId is not a valid ObjectId / no run with that id for this customer
401Missing/invalid bearer
403N/A
404N/A (400 used instead)
429Rate-limited
500Persistence failure mid-delete

See also

  • ScrapersscraperJobStatus rows are produced by GET /api/scraper/{id}/run
  • Scraped data — read the merged enriched dataset
  • GroupswithData=true on group delete removes the whole collection
  • Plan features — which tiers unlock DATA_ENRICHMENT