Skip to Content
REST APISitemaps

Sitemaps

A two-endpoint surface for bootstrapping a scraper’s URL inventory from the target site’s sitemap.xml. The harvest endpoint crawls <site>/sitemap.xml (and any referenced sub-sitemaps), persisting up to cap URLs as sitemap entries attached to the given scraper. The browse endpoint lets you read what was harvested, paginated and filtered.

Subject to the SITE_MAP plan feature.

Sitemap entries are wiped every night at 02:00 UTC — re-harvest daily if you need fresh state. They are intentionally ephemeral; use Sites instead for a persistent URL list backing a scraper.

Endpoint summary

MethodPathOperation IDAuth scope
POST/api/scraper/{scraperId}/sitemaps/harvestscrapewise_run_scraper_sitemaps_harvestbearer + SITE_MAP + idempotency-key
GET/api/scraper/{scraperId}/sitemapsscrapewise_get_scraper_sitemapsbearer

Harvest a sitemap — POST /api/scraper/{scraperId}/sitemaps/harvest

Triggers an async crawl of the target site’s sitemap.xml. The site URL is read from the scraper’s sourceConfig.url; this endpoint does NOT take a URL argument. Returns the count of URLs ingested.

curl -X POST \ -H "Authorization: Bearer $KEY" \ -H "Idempotency-Key: $(uuidgen)" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/scr_abc123/sitemaps/harvest?cap=1000"
ParamDefaultDescription
cap (query)500000Maximum number of URLs to ingest. Protects memory on very large sites.

Response (200)Long: the count of URLs ingested. 0 means the target site’s sitemap.xml was missing, unreachable, or empty.

Errors — 400 (scraper_not_found) / 401 / 402 (plan lacks SITE_MAP) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500 / 502 (sitemap.xml URL unreachable or malformed).

Browse harvested entries — GET /api/scraper/{scraperId}/sitemaps

Returns a paginated page of SlimSitemapEntryDTO records — slim by design (URL + image + image title + lastSeen, not full HTML). Use to inspect what’s been harvested before configuring a Site, or to filter to a sub-tree.

curl -H "Authorization: Bearer $KEY" \ "https://portal.scrapewise.ai/api/scraper-api/api/scraper/scr_abc123/sitemaps?page=0&size=100&search=shoes&restriction=BASE_PATH"

Query params

ParamTypeDescription
pageintZero-indexed page number
sizeintPage size
sortstringSpring-style sort, e.g. lastSeen,desc
filtersstringURL-encoded JSON Mongo-style filter on sitemap-entry fields
searchstringKeyword that matches anywhere in url / image / imageTitle
restrictionenumNONE (default) / BASE_PATH / SLASH_COUNT — narrows by URL shape

The restriction modes narrow the result-set by URL geometry:

  • NONE — no filter (returns everything matching search / filters)
  • BASE_PATH — only entries sharing the scraper’s base path
  • SLASH_COUNT — only entries with the same depth (count of /) as the scraper’s URL

Response (200) — Spring Page<SlimSitemapEntryDTO>:

{ "content": [ { "url": "https://...", "image": "https://...", "imageTitle": "...", "lastSeen": "2026-05-19T01:23:45Z" } ], "totalElements": 873, "totalPages": 9, "size": 100, "number": 0 }

Empty page = no harvest has run yet, or the harvest produced 0 URLs.

Errors — 400 (N/A — malformed query → 400 envelope) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500.

Typical flow

  1. Create a scraper pointing at the site root via PUT /api/scraper.
  2. Call this page’s /sitemaps/harvest to ingest the site’s URL inventory.
  3. Browse via /sitemaps with search= or restriction= to narrow to the URL shape you care about (e.g. product pages only).
  4. Promote the narrowed list into a persistent Site for the scraper to iterate over.
  5. Re-run harvest daily (cron / scheduled job) — entries are wiped at 02:00 UTC.

See also

  • Sites — persistent URL inventory backing a scraper. Promote sitemap entries into a Site once you’ve narrowed them.
  • Scrapers — Create or update — the scraper whose sourceConfig.url the harvest reads.
  • Plan features — which tier unlocks SITE_MAP.