Get-Set, Fetch! Nodejs Web Scraper

@get-set-fetch/scraper is a plugin based, batteries included, open source nodejs web scraper. It scrapes, stores and exports data. An ordered list of plugins (default or custom defined) is executed against each to be scraped web resource.

Supports multiple storage options: SQLite, MySQL, PostgreSQL. Supports multiple browser or dom-like clients: Puppeteer, Playwright, Cheerio, JSdom.

You can use it in your own javascript/typescript code, from the command line, with docker or in cloud.

Install the scraper

$ npm install @get-set-fetch/scraper

Install a storage solution

$ npm install knex @vscode/sqlite3

Supported storage options are defined as peer dependencies. You need to install at least one of them. Currently available: SQLite, MySQL, PostgreSQL. All of them require Knex.js query builder to be installed as well.

Install a browser client

$ npm install puppeteer

Supported browser clients are defined as peer dependencies. Supported browser clients: Puppeteer, Playwright. Supported DOM clients: Cheerio, JSDom.

Init storage

const { KnexConnection } = require('@get-set-fetch/scraper');
const connConfig = {
  client: 'sqlite3',
  useNullAsDefault: true,
  connection: {
    filename: ':memory:'
  }
}
const conn = new KnexConnection(connConfig);

See Storage on full configurations for supported SQLite, MySQL, PostgreSQL.

Init browser client

const { PuppeteerClient } = require('@get-set-fetch/scraper');
const launchOpts = {
  headless: true,
}
const client = new PuppeteerClient(launchOpts);

Init scraper

const { Scraper } = require('@get-set-fetch/scraper');
const scraper = new Scraper(conn, client);

Define a project

const projectOpts = {
  name: "myScrapeProject",
  pipeline: 'browser-static-content',
  pluginOpts: [
    {
      name: 'ExtractUrlsPlugin',
      maxDepth: 3,
      selectorPairs: [
        {
          urlSelector: '#searchResults ~ .pagination > a.ChoosePage:nth-child(2)',
        },
        {
          urlSelector: 'h3.booktitle a.results',
        },
        {
          urlSelector: 'a.coverLook > img.cover',
        },
      ],
    },
    {
      name: 'ExtractHtmlContentPlugin',
      selectorPairs: [
        {
          contentSelector: 'h1.work-title',
          label: 'title',
        },
        {
          contentSelector: 'h2.edition-byline a',
          label: 'author',
        },
        {
          contentSelector: 'ul.readers-stats > li.avg-ratings > span[itemProp="ratingValue"]',
          label: 'rating value',
        },
        {
          contentSelector: 'ul.readers-stats > li > span[itemProp="reviewCount"]',
          label: 'review count',
        },
      ],
    },
  ],
  resources: [
    {
      url: 'https://openlibrary.org/authors/OL34221A/Isaac_Asimov?page=1'
    }
  ]
};

You can define a project in multiple ways. The above example is the most direct one. You define one or more starting urls, a predefined pipeline containing a series of scrape plugins with default options, and any plugin options you want to override. See pipelines and plugins for all available options.

ExtractUrlsPlugin.maxDepth defines a maximum depth of resources to be scraped. The starting resource has depth 0. Resources discovered from it have depth 1 and so on. A value of -1 disables this check.

ExtractUrlsPlugin.selectorPairs defines CSS selectors for discovering new resources. urlSelector property selects the links while the optional titleSelector can be used for renaming binary resources like images or pdfs. In order, the define selectorPairs extract pagination URLs, book detail URLs, image cover URLs.

ExtractHtmlContentPlugin.selectorPairs scrapes content via CSS selectors. Optional labels can be used for specifying columns when exporting results as csv.

Define concurrency options

const concurrencyOpts = {
  project: {
    delay: 1000
  }
  domain: {
    delay: 5000
  }
}

A minimum delay of 5000 ms will be enforced between scraping consecutive resources from the same domain. At project level, across all domains, any two resources will be scraped with a minimum 1000 ms delay between requests. See concurrency options for all available options.

Start scraping

scraper.scrape(projectOpts, concurrencyOpts);

The entire process is asynchronous. Listen to the emitted scrape events to monitor progress.

Export results

const { ScrapeEvent, CsvExporter, ZipExporter } = require('@get-set-fetch/scraper');

scraper.on(ScrapeEvent.ProjectScraped, async (project) => {
  const csvExporter = new CsvExporter({ filepath: 'books.csv' });
  await csvExporter.export(project);

  const zipExporter = new ZipExporter({ filepath: 'book-covers.zip' });
  await zipExporter.export(project);
})

Wait for scraping to complete by listening to ProjectScraped event.

Export scraped html content as csv. Export scraped images under a zip archive. See Export for all supported parameters.