A sitemap index is an XML file that points to several sitemap files.

Large websites often split their URLs across many sitemap files. The sitemap index acts like a table of contents.

Example:

<sitemapindex>
  <sitemap>
    <loc>https://example.com/post-sitemap.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://example.com/product-sitemap.xml</loc>
  </sitemap>
</sitemapindex>

Sitemap vs sitemap index

A sitemap lists page URLs.

A sitemap index lists sitemap files.

This matters because a basic scraper that only reads one sitemap file can miss thousands of URLs. A scraper that follows sitemap indexes can collect the linked sitemap files too.

📌 Short version

If a website is large, start from the sitemap index. It usually gives better coverage than guessing individual sitemap files.

Where to find sitemap indexes

Try these URLs:

  • https://example.com/sitemap.xml
  • https://example.com/sitemap_index.xml
  • https://example.com/robots.txt

The robots.txt file often lists sitemap locations.

Datablist sitemap index workflow

The Sitemap Scraper can follow sitemap indexes and import URLs from the linked sitemap files.

After import, use regex filters to keep sections such as /blog/, /products/, /docs/, or /case-studies/. Then run metadata extraction, AI web scraping, or export the URL list to CSV.