Crawl a Site

RogerIQ can crawl public websites and create knowledge articles from pages.

Start a Crawl

json
{ "url": "https://example.com/docs", "maxPages": 100, "maxDepth": 3}

Defaults and Limits

OptionDefaultLimit
maxPages100500
maxDepth35

Crawl Output

Each indexed page can become a knowledge article with crawl metadata such as:

  • render method
  • content type
  • crawl duration
  • content length
  • crawled timestamp
  • source URL

Recrawling

Use recrawl when the source page changed and the RogerIQ article should be refreshed.

When Crawls Fail

Common causes:

  • page blocks bots
  • page requires authentication
  • content is rendered only after unsupported client behavior
  • page is too large
  • link graph exceeds max depth
  • rate limits or transient network errors

For product docs you control, HolyDocs sync is usually cleaner than generic crawling because it sends structured page content and stable external IDs.

Ask a question... ⌘I