Scraper jobs

A job (technically scraperJobStatus) is one execution attempt of a scraper. Each job carries its state (PENDING, RUNNING, COMPLETED, FAILED, STOPPED), timestamps, link counts, and any per-link errors. This page covers:

Reading the paginated job history (filterable by group / scraper / state / date).
Drilling into a specific job’s failed-link list.
Merging multiple jobs’ results into an enriched dataset (DATA_ENRICHMENT feature).
Destructive: deleting the scraped rows produced by a single job.

Endpoint summary

Method	Path	Operation ID	Auth scope
GET	`/api/scraper/load-history`	`scrapewise_get_scraper_load_history`	bearer
GET	`/api/scraper/load-history/group/{groupId}/job/{jobId}/errors`	`scrapewise_get_scraper_job_errors`	bearer
PUT	`/api/scraper/data/group/{id}/merge`	`scrapewise_run_scraper_data_group`	bearer + `DATA_ENRICHMENT` + idempotency-key
POST	`/api/scraper/data/preview-delete`	`scrapewise_delete_scraper_data_preview`	bearer
DELETE	`/api/scraper/data`	`scrapewise_delete_scraper_data`	bearer + idempotency-key

List job history — `GET /api/scraper/load-history`


curl -H "Authorization: Bearer $KEY" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper/load-history?page=0&size=20&filters=%7B%22state%22%3A%22FAILED%22%7D"

Paginated history of all scraper runs for the authenticated customer. Default sort: started DESC.

Query params

Param	Type	Description
`filters`	URL-encoded JSON	Common keys: `groupId`, `scraperId`, `state`. Empty = all runs.
`withDeleted`	boolean	Include soft-deleted runs (default `false`)
`page` / `size` / `sort`	Spring `Pageable`	Standard pagination

Response (200) — Page<ScraperJobStatusDTO>:


{
  "content": [
    {
      "id": "65a1...",
      "scraperId": "scr_abc...",
      "groupId": "grp_xyz...",
      "state": "COMPLETED",
      "started":  "2026-05-19T10:00:00Z",
      "finished": "2026-05-19T10:05:23Z",
      "successLinks": 198,
      "errorLinks": 2,
      "launchedByCustomer": true
    }
  ],
  "totalElements": 42,
  "totalPages": 3,
  "size": 20,
  "number": 0
}

launchedByCustomer distinguishes “I started this run” from “the group’s owning customer started it” (relevant for shared-group runs — see Shared groups).

Errors — 400 (malformed filters JSON) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500.

Cheaper alternative: for a single scraper’s most-recent state, read lastRunState on the scraper resource directly via GET /api/scraper/{id}.

Get a job’s per-link errors — `GET /api/scraper/load-history/group/{groupId}/job/{jobId}/errors`


curl -H "Authorization: Bearer $KEY" \
  https://portal.scrapewise.ai/api/scraper-api/api/scraper/load-history/group/grp_xyz/job/65a1.../errors

Returns the list of links that failed (HTTP error, parser error, timeout) during the given job. Use to diagnose why a run produced fewer rows than expected.

The groupId path component is used for cross-tenant ownership checks on shared groups — pass the group id the scraper belongs to.

Response (200) — List<LinkErrorDTO>:


[
  { "url": "https://example.com/p/1", "status": 503, "error": "Upstream temporarily unavailable" },
  { "url": "https://example.com/p/2", "status": null, "error": "Timeout after 30s" }
]

Empty list if all links succeeded.

Errors — 400 (no run with that jobId for this customer/group) / 401 / 403 (N/A) / 404 (N/A — 400 used instead) / 429 / 500.

Merge jobs into an enriched dataset — `PUT /api/scraper/data/group/{id}/merge`


PUT /api/scraper/data/group/grp_xyz/merge?scraperJobStatusIds=65a1...&scraperJobStatusIds=65a2...&scraperJobStatusIds=65a3...
Authorization: Bearer <key>
Idempotency-Key: <uuid>

Combines the scraped rows from N scraper runs (each from a different scraper) into one enriched dataset, matched on the title field. The title field must be a stable identifier like EAN / barcode (NOT a constant) for the merge to do useful work.

The result lands in the group’s enriched-sibling collection (same reference UUID, group name + _enriched). Requires the DATA_ENRICHMENT plan feature. Only the latest COMPLETED run per scraper is accepted.

Response (200)


{ "mergedItems": 184 }

Errors

Code	Meaning
400	Group doesn’t exist for this customer
401	Missing/invalid bearer
402	Plan lacks `DATA_ENRICHMENT`
403	N/A
404	N/A (400 used instead)
422	Zero rows merged — none of the supplied runs is the LATEST `COMPLETED` for its scraper, or rows have no matching `title`
429	Rate-limited
500	Malformed `scraperJobStatusId` (no upfront validation — surfaces here) / unexpected error

Delete a run’s data (destructive — two-call protocol) — `DELETE /api/scraper/data`

Destructive operation. Deletes every scraped row produced by a single run from the customer’s group collection. The scraper config and the scraperJobStatus record itself are preserved — only the actual product/row data is removed. Irreversible.

ADR-012 two-call pattern: first preview, then commit within 5 minutes.

Steps

POST /api/scraper/data/preview-delete?scraperJobStatusId=... — mints a 5-minute token + preview summary.
DELETE /api/scraper/data?scraperJobStatusId=... — commits the delete (idempotency-key required).

Skipping the preview step deletes rows without confirmation.

Step 1 — Preview


POST /api/scraper/data/preview-delete?scraperJobStatusId=65a1...
Authorization: Bearer <key>

Response (200)


{
  "token": "7c8d-...",
  "opName": "scrapewise_delete_scraper_data",
  "targetEntityId": "65a1...",
  "previewSummary": {
    "entityName": "65a1...",
    "entityType": "scraper_run_data",
    "cascadeCounts": {},
    "warnings": [
      "Scraped product rows produced by this run will be permanently deleted (irreversible)."
    ]
  }
}

Render the preview to the user. Only proceed to step 2 on explicit approval.

Step 2 — Commit


DELETE /api/scraper/data?scraperJobStatusId=65a1...
Authorization: Bearer <key>
Idempotency-Key: <uuid>

Response — 204 No Content.

Errors (both steps)

Code	Meaning
400	`scraperJobStatusId` is not a valid ObjectId / no run with that id for this customer
401	Missing/invalid bearer
403	N/A
404	N/A (400 used instead)
429	Rate-limited
500	Persistence failure mid-delete

Scraper jobs

Endpoint summary

List job history — GET /api/scraper/load-history

Get a job’s per-link errors — GET /api/scraper/load-history/group/{groupId}/job/{jobId}/errors

Merge jobs into an enriched dataset — PUT /api/scraper/data/group/{id}/merge

Delete a run’s data (destructive — two-call protocol) — DELETE /api/scraper/data