Sites
A Site is the URL inventory for a scraper that runs against a fixed set of pages (vs a paginated category listing). Use Sites when you want a scraper to iterate “these 50 specific URLs” rather than crawl a category page.
Subject to the SITE_MAP plan feature on mutation. Read endpoints work for any authenticated customer who owns the scraper.
Endpoint summary
| Method | Path | Operation ID | Auth scope |
|---|---|---|---|
| PUT | /api/scraper/site | scrapewise_create_scraper_site | bearer + SITE_MAP + idempotency-key |
| GET | /api/scraper/{id}/site | scrapewise_get_scraper_site | bearer |
| GET | /api/scraper/site/{siteId}/links | scrapewise_get_scraper_site_links | bearer |
| DELETE | /api/scraper/site/{siteId} | scrapewise_delete_scraper_site | bearer + idempotency-key |
Create or update a Site — PUT /api/scraper/site
PUT /api/scraper/site
Authorization: Bearer <key>
Idempotency-Key: <uuid>
Content-Type: application/json
{
"id": null,
"name": "competitor-x product pages",
"scraperRef": "scr_abc123",
"links": [
{ "url": "https://competitor-x.com/product/1", "title": "Product 1" },
{ "url": "https://competitor-x.com/product/2", "title": "Product 2" }
]
}Empty id → create new; non-empty id → update. The initial Link list can be passed inline via links[].
Response (200) — persisted SiteDTO with the assigned id.
Errors — 400 (validation) / 401 / 402 (plan lacks SITE_MAP) / 403 (N/A) / 404 (N/A) / 429 / 500.
Adding more links to an existing Site: today there is no incremental add-link endpoint — re-PUT the full
SiteDTOwith the union of old + new links.
Get a scraper’s Site — GET /api/scraper/{id}/site
curl -H "Authorization: Bearer $KEY" \
https://portal.scrapewise.ai/api/scraper-api/api/scraper/scr_abc123/siteReturns the Site attached to a given scraper (note: the path id is the scraper’s id, NOT the site’s id — the lookup is “what site does this scraper use?”).
Response (200) — SiteDTO or null. Not every scraper uses a Site (single-page scrapers and paginated category-listing scrapers don’t). For the actual link inventory, follow up with /api/scraper/site/{siteId}/links using the site id from this response.
Errors — 400 (N/A) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500.
List Site links (paginated, filterable) — GET /api/scraper/site/{siteId}/links
curl -H "Authorization: Bearer $KEY" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/site/5f9a.../links?page=0&size=100&sortField=url&sortDirection=asc"Paginated URL inventory. Each Link carries url, title, and per-link state (visited, errored, pending).
Query params
| Param | Default | Description |
|---|---|---|
page | 0 | Zero-indexed page number |
size | 100 | Page size, capped at 1000 |
sortField | url | One of url, title, curl (whitelisted) |
sortDirection | asc | asc or desc (case-insensitive) |
filters | — | URL-encoded MongoDB-style filter expression, e.g. {"state":"ERRORED"} |
Out-of-range page/size are silently clamped (not 500ed). Unknown sortField values fall back to url with a server-side warn log.
Filter fields are whitelisted: url, title, curl. Logical operators $and, $or, $nor are supported and recursively sanitised.
Response (200) — Spring Page<LinkDTO> ({ content, totalElements, totalPages, ... }).
Errors — 400 (invalid siteId ObjectId / malformed filters JSON / site not found for customer) / 401 / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500.
Delete a Site (destructive) — DELETE /api/scraper/site/{siteId}
Destructive operation. Deleting a Site cascades to delete every Link attached to it. The scraper itself is preserved — only the URL inventory is removed.
This endpoint is idempotent by design (ADR-013): deleting a site that’s already gone returns 204 No Content rather than 404. Agent retries are safe.
DELETE /api/scraper/site/5f9a1b2c3d4e5f6a7b8c9d0e
Authorization: Bearer <key>
Idempotency-Key: <uuid>Steps (single-call delete — Sites are an exception to the two-call destructive pattern because the cascade impact is bounded to the Link inventory):
DELETE /api/scraper/site/{siteId}withIdempotency-Key.
Skipping idempotency on the header is allowed but means a retry may double-execute (no-op on second call thanks to ADR-013, but the header is still required by @RequireIdempotencyKey).
Response — 204 No Content. Verify deletion by re-fetching /api/scraper/{scraperId}/site, which should now return null.
Errors
| Code | Meaning |
|---|---|
| 400 | siteId is not a valid ObjectId |
| 401 | Missing/invalid bearer |
| 403 | N/A (cross-tenant returns 204 via ADR-013) |
| 404 | N/A (idempotent — 204 used instead) |
| 429 | Rate-limited |
| 500 | Persistence failure |
See also
- Scrapers — site-backed scrapers iterate the Site’s link list at run time
- Scraper jobs — per-link error details in
load-history/group/{groupId}/job/{jobId}/errors - Plan features — which tiers unlock
SITE_MAP