Skip to Content

Sites

A Site is the URL inventory for a scraper that runs against a fixed set of pages (vs a paginated category listing). Use Sites when you want a scraper to iterate “these 50 specific URLs” rather than crawl a category page.

Subject to the SITE_MAP plan feature on mutation. Read endpoints work for any authenticated customer who owns the scraper.

Endpoint summary

MethodPathOperation IDAuth scope
PUT/api/scraper/sitescrapewise_create_scraper_sitebearer + SITE_MAP + idempotency-key
GET/api/scraper/{id}/sitescrapewise_get_scraper_sitebearer
GET/api/scraper/site/{siteId}/linksscrapewise_get_scraper_site_linksbearer
DELETE/api/scraper/site/{siteId}scrapewise_delete_scraper_sitebearer + idempotency-key

Create or update a Site — PUT /api/scraper/site

PUT /api/scraper/site Authorization: Bearer <key> Idempotency-Key: <uuid> Content-Type: application/json { "id": null, "name": "competitor-x product pages", "scraperRef": "scr_abc123", "links": [ { "url": "https://competitor-x.com/product/1", "title": "Product 1" }, { "url": "https://competitor-x.com/product/2", "title": "Product 2" } ] }

Empty id → create new; non-empty id → update. The initial Link list can be passed inline via links[].

Response (200) — persisted SiteDTO with the assigned id.

Errors — 400 (validation) / 401 / 402 (plan lacks SITE_MAP) / 403 (N/A) / 404 (N/A) / 429 / 500.

Adding more links to an existing Site: today there is no incremental add-link endpoint — re-PUT the full SiteDTO with the union of old + new links.

Get a scraper’s Site — GET /api/scraper/{id}/site

curl -H "Authorization: Bearer $KEY" \ https://portal.scrapewise.ai/api/scraper-api/api/scraper/scr_abc123/site

Returns the Site attached to a given scraper (note: the path id is the scraper’s id, NOT the site’s id — the lookup is “what site does this scraper use?”).

Response (200)SiteDTO or null. Not every scraper uses a Site (single-page scrapers and paginated category-listing scrapers don’t). For the actual link inventory, follow up with /api/scraper/site/{siteId}/links using the site id from this response.

Errors — 400 (N/A) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500.

curl -H "Authorization: Bearer $KEY" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/site/5f9a.../links?page=0&size=100&sortField=url&sortDirection=asc"

Paginated URL inventory. Each Link carries url, title, and per-link state (visited, errored, pending).

Query params

ParamDefaultDescription
page0Zero-indexed page number
size100Page size, capped at 1000
sortFieldurlOne of url, title, curl (whitelisted)
sortDirectionascasc or desc (case-insensitive)
filtersURL-encoded MongoDB-style filter expression, e.g. {"state":"ERRORED"}

Out-of-range page/size are silently clamped (not 500ed). Unknown sortField values fall back to url with a server-side warn log.

Filter fields are whitelisted: url, title, curl. Logical operators $and, $or, $nor are supported and recursively sanitised.

Response (200) — Spring Page<LinkDTO> ({ content, totalElements, totalPages, ... }).

Errors — 400 (invalid siteId ObjectId / malformed filters JSON / site not found for customer) / 401 / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.

Delete a Site (destructive) — DELETE /api/scraper/site/{siteId}

Destructive operation. Deleting a Site cascades to delete every Link attached to it. The scraper itself is preserved — only the URL inventory is removed.

This endpoint is idempotent by design (ADR-013): deleting a site that’s already gone returns 204 No Content rather than 404. Agent retries are safe.

DELETE /api/scraper/site/5f9a1b2c3d4e5f6a7b8c9d0e Authorization: Bearer <key> Idempotency-Key: <uuid>

Steps (single-call delete — Sites are an exception to the two-call destructive pattern because the cascade impact is bounded to the Link inventory):

  1. DELETE /api/scraper/site/{siteId} with Idempotency-Key.

Skipping idempotency on the header is allowed but means a retry may double-execute (no-op on second call thanks to ADR-013, but the header is still required by @RequireIdempotencyKey).

Response204 No Content. Verify deletion by re-fetching /api/scraper/{scraperId}/site, which should now return null.

Errors

CodeMeaning
400siteId is not a valid ObjectId
401Missing/invalid bearer
403N/A (cross-tenant returns 204 via ADR-013)
404N/A (idempotent — 204 used instead)
429Rate-limited
500Persistence failure

See also

  • Scrapers — site-backed scrapers iterate the Site’s link list at run time
  • Scraper jobs — per-link error details in load-history/group/{groupId}/job/{jobId}/errors
  • Plan features — which tiers unlock SITE_MAP