What Actually Counts as a Page — and Why It Matters
The crawl engine now only follows links found in actual HTML pages, not in JavaScript or other assets. Scans may show a lower page count — but the results are more accurate, and unverified websites get more of their scan used on real pages.
Crawl Engine Update: Smarter, More Accurate Results
We have shipped a series of engine improvements focused on one goal: making your crawl results reflect what is actually on your website, not what the crawler incorrectly thought was there.
Breaking change: fewer pages, better results
The most visible change in this update is that scans may report a lower page count than before — sometimes significantly lower. This is intentional and a good thing.
Previously, the crawler followed a broad rule: if a server responded with a text-based content type, it would parse that response for links. In practice, this caused JavaScript files to be treated as pages. The crawler would dig through JavaScript source code and extract anything that looked like a URL — template strings, JSON data, code comments — and add them to the crawl queue. The result was a mix of real pages and noise: asset paths, placeholder URLs, and routes that exist only inside a script.
Starting with this update, the crawler only parses HTML and XHTML responses for links. This is the same rule a browser applies: only rendered pages contain discoverable links. Non-HTML resources like JavaScript, CSS, and plain text files are fetched and checked for availability, but they are no longer treated as sources of new URLs to visit.
What this means for unverified websites
Unregistered websites have a limit on how many pages can be crawled per scan. With the old engine, JavaScript files and other assets were silently consuming that limit. A site with a large JavaScript bundle could burn through a significant portion of the available pages before any real HTML pages were even discovered. Now that only HTML pages are used as link sources, the crawl goes further and returns results that are actually useful.
More false positive fixes under the hood
Beyond the content-type change, several other sources of incorrect URLs have been addressed:
Links using pseudo-protocols — javascript:, mailto:, tel:, data: — are now filtered out at extraction time rather than reaching the crawl queue and failing there. Meta-refresh redirects with quoted URL values are now parsed correctly.
URL normalisation has also been improved: the engine now handles international domain names (IDN/Punycode), lowercases hostnames, strips URL fragments to prevent the same page from being queued twice, and rejects malformed URLs early. Previously, URLs containing template placeholders like ?s={search_term_string} — common in WordPress OpenSearch definitions — would pass initial validation but then cause errors during the crawl. These are now rejected before they reach that stage.
UI improvements
The results table now shows a tooltip with the full URL when you hover over a truncated link. On mobile, long URLs are truncated cleanly without causing layout issues in the statistics view.
The scan limit indicator has been updated to distinguish between two different limits: the crawled pages limit and the total requests limit. If a scan stops early, you can now see which of the two was reached.
Response time is no longer shown for URLs that were discovered but never fetched — for example, external links that were found but not crawled. The column now correctly shows a dash for those entries instead of a misleading zero.