Generate a sitemap

Follow and extract all html links from a site.
Override default plugins and use a custom exporter to generate the sitemap based on the scraped URLs.

A sitemap is just an XML file with a parent <urlset> tag containing <url> entries. At minimum each <url> tag contains a <loc> entry corresponding to the page URL. We’re going to generate the getsetfetch.org sitemap looking like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>https://www.getsetfetch.org/index.html</loc></url>
....
</urlset>

                Copied!
                
            

See full script and examples section on how to launch it.

Scrape Configuration

Starting from an initial URL we need to parse, extract and visit all identified URLs from the same domain. We don’t need a browser for this, a DOM client like cheerio will do.

"client": {
    "name": "cheerio"
}

                Copied!

In @get-set-fetch/scraper each url represents a resource. Depending on the selected pipeline a series of plugins is executed against each resource.

ExtractUrlsPlugin is responsible for parsing html resources and extracting new URLs to be scraped. By default it only extracts URLs from html links ending in .html due to urlSelector default value of a[href$=".html"]. This obviously is not enough; html documents may have extensions different than .html or no extension at all. Follow all links using the generic selector a. This will also include external links outside the sitemap domain. We need to filter these ones out by extending the default plugin class and overriding isValidUrl method. Only URLs from the getsetfetch.org domain will be extracted. ExtractUrlsPlugin is a plugin that can run in both browser and nodejs environment depending on the domRead, domWrite flags. Disable such inherited flags for the custom plugin since it’s not going to run in browser.

export default class ExtractSameHostUrlsPlugin extends ExtractUrlsPlugin {
  constructor(opts:Partial<PluginOpts> = {}) {
    super(opts);
    this.opts.domRead = false;
  }

  isValidUrl(url: URL) {
    return url.hostname === 'www.getsetfetch.org';
  }
}

                Copied!
                
            

Since we’re not interested in extracting content, we can replace the default ExtractHtmlContentPlugin with a plugin that never triggers, always returning false from the test method.

export default class SkipExtractHtmlContentPlugin extends Plugin {
  test() {
    return false;
  }

  apply() {}
}

                Copied!
                
            

Refer the custom plugins in the project options:

"project": {
  "name": "sitemap",
  "resources": [
    {
      "url": "https://www.getsetfetch.org/index.html"
    }
  ],
  "pipeline": "dom-static-content",
  "pluginOpts": [
    {
      "name": "ExtractSameHostUrlsPlugin",
      "path": "ExtractSameHostUrlsPlugin.ts",
      "replace": "ExtractUrlsPlugin"
    },
    {
      "name": "SkipExtractHtmlContentPlugin",
      "path": "SkipExtractHtmlContentPlugin.ts",
      "replace": "ExtractHtmlContentPlugin"
    }
  ]
}

                Copied!
                
            

No rush :). Go easy on the site you’re scraping. You can always stop and resume the scraping process since results are stored in a database. At session level restrict the maximum number of parallel requests to 1 with a 3 second delay between them. This transforms the scraping process from parallel to sequential. See concurrency options for more details. Monitor progress by checking the generated scrape.log file.

"concurrency": {
    "session": {
        "maxRequests": 1,
        "delay": 3000
    }
}

                Copied!
                
            

Here’s the full scrape configuration.

Custom Sitemap Exporter

Now that the scrape configuration is done, we can focus on the custom exporter responsible for generating the sitemap based on the scraped resources. At resource level getResourceQuery defines what database columns to be retrieved together with some filtering criteria. The sitemap is only generated for html resources having contentType set to text/html. A fs.writeStream is used to write the sitemap XML file to disk.

export default class SitemapExporter extends Exporter {
  logger = getLogger('SitemapExporter');

  wstream: fs.WriteStream;

  getResourceQuery() {
    return { cols: [ 'url' ], where: { contentType: 'text/html' } };
  }

  async preParse() {
    this.wstream = fs.createWriteStream(this.opts.filepath);
    this.wstream.write('<?xml version="1.0" encoding="UTF-8"?>\n');
    this.wstream.write('<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n');
  }

  async parse(resource: Partial<Resource>) {
    this.wstream.write(`<url><loc>${resource.url}</loc></url>\n`);
  }

  async postParse() {
    this.wstream.write('</urlset>');
    this.wstream.close();
  }
}

                Copied!
                
            

Start Scraping

Putting it all together:

import { Scraper, ScrapeEvent, Project } from '@get-set-fetch/scraper';

import ScrapeConfig from './scrape-config.json';
import SitemapExporter from './SitemapExporter';

const scraper = new Scraper(ScrapeConfig.storage, ScrapeConfig.client);

scraper.on(ScrapeEvent.ProjectScraped, async (project: Project) => {
  const exporter = new SitemapExporter({ filepath: 'sitemap.xml' });
  await exporter.export(project);
});

scraper.scrape(ScrapeConfig.project, ScrapeConfig.concurrency);

                Copied!