Preview — auto-detect scraper configs

Two convenience endpoints used by the scraper-builder UI’s “Simple Mode”. Hand over a target URL (or a curl invocation) and the service runs every applicable detection strategy in parallel — SINGLE_PRODUCT, MULTIPLE_PRODUCTS, APPLICATION_LD_JSON, plus the Amazon / Google specialisations — and returns the candidate scraper configs that successfully produced data, along with their sample rows.

Each candidate is plan-feature gated; only configs the customer’s plan can actually run are returned. Nothing persists. Pick a candidate, then save it via PUT /api/scraper.

Endpoint summary

Method	Path	Operation ID	Auth scope
POST	`/api/scraper-simple/url-based`	`scrapewise_preview_scraper_from_url`	bearer
POST	`/api/scraper-simple/curl-based`	`scrapewise_preview_scraper_from_curl`	bearer + `API` feature

Preview from URL — `POST /api/scraper-simple/url-based`


POST /api/scraper-simple/url-based?tryWithHiddenData=false&useCache=false
Authorization: Bearer <key>
Content-Type: application/json
 
{
  "id": null,
  "name": "competitor-prices",
  "groupId": "grp_xyz",
  "url": "https://competitor.com/products",
  "itemsConfig": null
}

Request body (SimpleModeWithUrlDTO)

Field	Required	Type	Description
`id`	no	string	Existing scraper id when previewing changes to a saved scraper
`name`	yes	string	Display name for the future scraper (server uses it as a tag on candidate DTOs)
`groupId`	yes	string	Owning group’s MongoDB ObjectId (see Groups)
`url`	yes	string	The page to analyse. Public HTTP(S); internal / link-local hosts are SSRF-blocked
`itemsConfig`	no	array	Pre-existing field selectors to bias detection toward

Query params

Param	Default	Description
`tryWithHiddenData`	`false`	Enable a more aggressive JSON-LD pass that reads scripts hidden by the renderer
`useCache`	`false`	Reuse a recent fetch instead of hitting the network. Faster but may serve stale HTML

Response (200) — Set<ScraperSampleDataDTO>. One entry per detection strategy that produced data, plan-filtered to what your plan can run:


[
  {
    "scraperDTO": {
      "name": "competitor-prices",
      "groupRef": "grp_xyz",
      "type": "MULTIPLE_PRODUCTS",
      "sourceConfig": {
        "url": "https://competitor.com/products",
        "pagination": { "type": "URL_PAGE_PARAM", "param": "page", "max": 50 }
      },
      "itemsConfig": [
        { "name": "title",    "selector": "h2.product-title",   "kind": "TEXT" },
        { "name": "price",    "selector": ".price",             "kind": "TEXT" },
        { "name": "imageUrl", "selector": "img.product-image",  "kind": "ATTR", "attr": "src" }
      ]
    },
    "sampleData": [
      { "title": "Premium Widget XL", "price": "29.99", "imageUrl": "https://..." },
      { "title": "Standard Widget",   "price": "19.99", "imageUrl": "https://..." }
    ],
    "executionTimeSec": 2.4
  }
]

Empty array is a normal success state — no detection strategy succeeded for this URL on this plan; treat it as “this URL is unsupported by the available detectors”.

Errors — 400 (validation failed) / 401 / 403 (N/A) / 404 (N/A) / 429 / 500 (correlationId returned if a detector or downstream proxy crashes).

Node.js


const res = await fetch(
  'https://portal.scrapewise.ai/api/scraper-api/api/scraper-simple/url-based',
  {
    method: 'POST',
    headers: {
      Authorization: `Bearer ${process.env.SCRAPEWISE_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      name: 'competitor-prices',
      groupId: 'grp_xyz',
      url: 'https://competitor.com/products',
    }),
  }
);
const candidates = await res.json();
const pick = candidates[0];          // pick whichever shape matches your need
// → then persist via PUT /api/scraper with `pick.scraperDTO`

Preview from curl — `POST /api/scraper-simple/curl-based`

Same idea but the input is a full curl invocation. Use when the target endpoint needs custom request shape — cookies, headers, an auth bearer, or a POST body — that wouldn’t survive a plain GET. Scrapewise parses the curl, replays the exact request, then runs detection on the response.

Requires the API plan feature.


POST /api/scraper-simple/curl-based
Authorization: Bearer <key>
Content-Type: application/json
 
{
  "name": "vendor-graphql-products",
  "groupId": "grp_xyz",
  "curl": "curl 'https://vendor.example/graphql' -H 'Cookie: session=...' -H 'Content-Type: application/json' --data-raw '{\"query\":\"{ products { id title price } }\"}'"
}

Request body (SimpleModeWithCurlDTO)

Field	Required	Type	Description
`id`	no	string	Existing scraper id when previewing changes
`name`	yes	string	Display name
`groupId`	yes	string	Owning group’s MongoDB ObjectId
`curl`	yes	string	Full curl command (single-line or with line continuations). Supports `-X` / `-H` / `-d` / `--cookie` / etc.

Response (200) — same Set<ScraperSampleDataDTO> shape as /url-based.

Errors — 400 (curl string malformed) / 401 / 402 (plan lacks API feature) / 403 (N/A) / 404 (N/A) / 429 / 500.

What to do with a candidate

The response is a list. Each entry’s scraperDTO is a ready-to-persist payload. To save the picked candidate as a real scraper:


curl -X PUT \
  -H "Authorization: Bearer $KEY" \
  -H "Idempotency-Key: $(uuidgen)" \
  -H "Content-Type: application/json" \
  -d "$(cat candidate-from-preview.json)" \
  "https://portal.scrapewise.ai/api/scraper-api/api/scraper"

That returns the persisted scraper with its id. Then trigger runs via GET /api/scraper/{id}/run.

When to use Preview vs a real scraper

Question	Preview	Real scraper
Just exploring what a page yields?	✅	—
Need to scrape the same page repeatedly?	—	✅
Need pagination / scheduled runs / historical storage?	—	✅
Want to A/B several detector strategies before committing?	✅	—

Preview is the “test drive.” For production, save the picked candidate as a real scraper.