Scraping Pipelines - Load Web Pages, Extract New URLs, Extract Content, Save to Database

Each pipeline contains a series of plugins with predefined values for all plugin options. A project configuration extends a pipeline by replacing/adding new plugins or overriding the predefined plugin options.

Take a look at Examples for real world project configurations.

Base Pipeline: Static Content

Used to scrape static data, does not rely on javascript to either read or alter the html content.

Comes in two variants browser-static-content, dom-static-content. First one runs in browser, second one makes use of a dom-like parsing library such as cheerio.

Pipeline: Browser-Static-Content

Sequentially executed plugins.

BrowserFetchPlugin
- gotoOptions.timeout: 30000
- gotoOptions.waitUntil: domcontentloaded
- stabilityCheck: 0
- stabilityTimeout: 0
ExtractUrlsPlugin
- domRead: true
- maxDepth: -1
- selectorPairs: [ { urlSelector: ‘a[href$=”.html”]’ } ]
ExtractHtmlContentPlugin
- domRead: true
- selectorPairs: []
InsertResourcesPlugin
- maxResources: -1
UpsertResourcePlugin
- keepHtmlData: false

Pipeline: Dom-Static-Content

Sequentially executed plugins.

NodeFetchPlugin
- headers: { ‘Accept-Encoding’: ‘br,gzip,deflate’ }
ExtractUrlsPlugin
- domRead: false
- maxDepth: -1
- selectorPairs: [ { urlSelector: ‘a[href$=”.html”]’ } ]
ExtractHtmlContentPlugin
- domRead: false
- selectorPairs: []
InsertResourcesPlugin
- maxResources: -1
UpsertResourcePlugin
- keepHtmlData: false

Static Content Examples

Limit scraping to a single page by setting ExtractUrlsPlugin.maxDepth to 0.

scraper.scrape({
  name: "singlePageScraping",
  pipeline: 'browser-static-content',
  pluginOpts: [
    {
      name: 'ExtractUrlsPlugin',
      maxDepth: 0,
    }
  ],
  resources: [
    {
      url: 'startUrl'
    }
  ]
})

                Copied!
                
            

Scrape from each html page all elements found by the h1.title CSS selector.

scraper.scrape({
  name: 'h1TitleScraping',
  pipeline: 'browser-static-content',
  pluginOpts: [
    {
      name: 'ExtractHtmlContentPlugin',
      selectorPairs: [
        {
          contentSelector: 'h1.title',
          label: 'main title',
        },
      ]
    }
  ],
  resources: [
    {
      url: 'startUrl'
    }
  ]
})

                Copied!
                
            

Add a new ScrollPlugin to the pipeline and scroll html pages to reveal further dynamically loaded content.

scraper.scrape({
  name: 'scrollScraping',
  pipeline: 'browser-static-content',
  pluginOpts: [
    {
      name: 'ScrollPlugin',
      after: 'UpsertResourcePlugin',
      stabilityCheck: 1000,
    }
  ],
  resources: [
    {
      url: 'startUrl'
    }
  ]
})

                Copied!
                
            

Replace ExtractHtmlContentPlugin with an external plugin. path is relative to current working directory / config file directory when invoked from a module / cli.

scraper.scrape({
  name: 'customScraping',
  pipeline: 'browser-static-content',
  pluginOpts: [
    {
      name: 'H1CounterPlugin',
      path: "../plugins/h1-counter-plugin.js",
      replace: 'ExtractHtmlContentPlugin',
      customOptionA: 1000,
      customOptionB: 'h1',
    }
  ],
  resources: [
    {
      url: 'startUrl'
    }
  ]
})

                Copied!