Sitemaps
A two-endpoint surface for bootstrapping a scraper’s URL inventory from the target site’s sitemap.xml. The harvest endpoint crawls <site>/sitemap.xml (and any referenced sub-sitemaps), persisting up to cap URLs as sitemap entries attached to the given scraper. The browse endpoint lets you read what was harvested, paginated and filtered.
Subject to the SITE_MAP plan feature.
Sitemap entries are wiped every night at 02:00 UTC — re-harvest daily if you need fresh state. They are intentionally ephemeral; use Sites instead for a persistent URL list backing a scraper.
Endpoint summary
| Method | Path | Operation ID | Auth scope |
|---|---|---|---|
| POST | /api/scraper/{scraperId}/sitemaps/harvest | scrapewise_run_scraper_sitemaps_harvest | bearer + SITE_MAP + idempotency-key |
| GET | /api/scraper/{scraperId}/sitemaps | scrapewise_get_scraper_sitemaps | bearer |
Harvest a sitemap — POST /api/scraper/{scraperId}/sitemaps/harvest
Triggers an async crawl of the target site’s sitemap.xml. The site URL is read from the scraper’s sourceConfig.url; this endpoint does NOT take a URL argument. Returns the count of URLs ingested.
curl -X POST \
-H "Authorization: Bearer $KEY" \
-H "Idempotency-Key: $(uuidgen)" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/scr_abc123/sitemaps/harvest?cap=1000"| Param | Default | Description |
|---|---|---|
cap (query) | 500000 | Maximum number of URLs to ingest. Protects memory on very large sites. |
Response (200) — Long: the count of URLs ingested. 0 means the target site’s sitemap.xml was missing, unreachable, or empty.
Errors — 400 (scraper_not_found) / 401 / 402 (plan lacks SITE_MAP) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500 / 502 (sitemap.xml URL unreachable or malformed).
Browse harvested entries — GET /api/scraper/{scraperId}/sitemaps
Returns a paginated page of SlimSitemapEntryDTO records — slim by design (URL + image + image title + lastSeen, not full HTML). Use to inspect what’s been harvested before configuring a Site, or to filter to a sub-tree.
curl -H "Authorization: Bearer $KEY" \
"https://portal.scrapewise.ai/api/scraper-api/api/scraper/scr_abc123/sitemaps?page=0&size=100&search=shoes&restriction=BASE_PATH"Query params
| Param | Type | Description |
|---|---|---|
page | int | Zero-indexed page number |
size | int | Page size |
sort | string | Spring-style sort, e.g. lastSeen,desc |
filters | string | URL-encoded JSON Mongo-style filter on sitemap-entry fields |
search | string | Keyword that matches anywhere in url / image / imageTitle |
restriction | enum | NONE (default) / BASE_PATH / SLASH_COUNT — narrows by URL shape |
The restriction modes narrow the result-set by URL geometry:
NONE— no filter (returns everything matchingsearch/filters)BASE_PATH— only entries sharing the scraper’s base pathSLASH_COUNT— only entries with the same depth (count of/) as the scraper’s URL
Response (200) — Spring Page<SlimSitemapEntryDTO>:
{
"content": [
{ "url": "https://...", "image": "https://...", "imageTitle": "...", "lastSeen": "2026-05-19T01:23:45Z" }
],
"totalElements": 873,
"totalPages": 9,
"size": 100,
"number": 0
}Empty page = no harvest has run yet, or the harvest produced 0 URLs.
Errors — 400 (N/A — malformed query → 400 envelope) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500.
Typical flow
- Create a scraper pointing at the site root via
PUT /api/scraper. - Call this page’s
/sitemaps/harvestto ingest the site’s URL inventory. - Browse via
/sitemapswithsearch=orrestriction=to narrow to the URL shape you care about (e.g. product pages only). - Promote the narrowed list into a persistent Site for the scraper to iterate over.
- Re-run harvest daily (cron / scheduled job) — entries are wiped at 02:00 UTC.
See also
- Sites — persistent URL inventory backing a scraper. Promote sitemap entries into a Site once you’ve narrowed them.
- Scrapers — Create or update — the scraper whose
sourceConfig.urlthe harvest reads. - Plan features — which tier unlocks
SITE_MAP.