Sitemaps

A two-endpoint surface for bootstrapping a scraper’s URL inventory from the target site’s sitemap.xml. The harvest endpoint crawls <site>/sitemap.xml (and any referenced sub-sitemaps), persisting up to cap URLs as sitemap entries attached to the given scraper. The browse endpoint lets you read what was harvested, paginated and filtered.

Subject to the SITE_MAP plan feature.

Sitemap entries are wiped every night at 02:00 UTC — re-harvest daily if you need fresh state. They are intentionally ephemeral; use Sites instead for a persistent URL list backing a scraper.

Endpoint summary

Method	Path	Operation ID	Auth scope
POST	`/api/scraper/{scraperId}/sitemaps/harvest`	`scrapewise_run_scraper_sitemaps_harvest`	bearer + `SITE_MAP` + idempotency-key
GET	`/api/scraper/{scraperId}/sitemaps`	`scrapewise_get_scraper_sitemaps`	bearer

Harvest a sitemap — `POST /api/scraper/{scraperId}/sitemaps/harvest`

Triggers an async crawl of the target site’s sitemap.xml. The site URL is read from the scraper’s sourceConfig.url; this endpoint does NOT take a URL argument. Returns the count of URLs ingested.


curl -X POST \
  -H "Authorization: Bearer $KEY" \
  -H "Idempotency-Key: $(uuidgen)" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper/scr_abc123/sitemaps/harvest?cap=1000"

Param	Default	Description
`cap` (query)	`500000`	Maximum number of URLs to ingest. Protects memory on very large sites.

Response (200) — Long: the count of URLs ingested. 0 means the target site’s sitemap.xml was missing, unreachable, or empty.

Errors — 400 (scraper_not_found) / 401 / 402 (plan lacks SITE_MAP) / 403 (N/A) / 404 (N/A — 400 used) / 429 / 500 / 502 (sitemap.xml URL unreachable or malformed).

Browse harvested entries — `GET /api/scraper/{scraperId}/sitemaps`

Returns a paginated page of SlimSitemapEntryDTO records — slim by design (URL + image + image title + lastSeen, not full HTML). Use to inspect what’s been harvested before configuring a Site, or to filter to a sub-tree.


curl -H "Authorization: Bearer $KEY" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper/scr_abc123/sitemaps?page=0&size=100&search=shoes&restriction=BASE_PATH"

Query params

Param	Type	Description
`page`	int	Zero-indexed page number
`size`	int	Page size
`sort`	string	Spring-style sort, e.g. `lastSeen,desc`
`filters`	string	URL-encoded JSON Mongo-style filter on sitemap-entry fields
`search`	string	Keyword that matches anywhere in `url` / `image` / `imageTitle`
`restriction`	enum	`NONE` (default) / `BASE_PATH` / `SLASH_COUNT` — narrows by URL shape

The restriction modes narrow the result-set by URL geometry:

NONE — no filter (returns everything matching search / filters)
BASE_PATH — only entries sharing the scraper’s base path
SLASH_COUNT — only entries with the same depth (count of /) as the scraper’s URL

Response (200) — Spring Page<SlimSitemapEntryDTO>:


{
  "content": [
    { "url": "https://...", "image": "https://...", "imageTitle": "...", "lastSeen": "2026-05-19T01:23:45Z" }
  ],
  "totalElements": 873,
  "totalPages": 9,
  "size": 100,
  "number": 0
}

Empty page = no harvest has run yet, or the harvest produced 0 URLs.

Errors — 400 (N/A — malformed query → 400 envelope) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500.

Typical flow

Create a scraper pointing at the site root via PUT /api/scraper.
Call this page’s /sitemaps/harvest to ingest the site’s URL inventory.
Browse via /sitemaps with search= or restriction= to narrow to the URL shape you care about (e.g. product pages only).
Promote the narrowed list into a persistent Site for the scraper to iterate over.
Re-run harvest daily (cron / scheduled job) — entries are wiped at 02:00 UTC.

Sitemaps

Endpoint summary

Harvest a sitemap — POST /api/scraper/{scraperId}/sitemaps/harvest

Browse harvested entries — GET /api/scraper/{scraperId}/sitemaps

Query params

Typical flow

See also

Harvest a sitemap — `POST /api/scraper/{scraperId}/sitemaps/harvest`

Browse harvested entries — `GET /api/scraper/{scraperId}/sitemaps`