Each pipeline contains a series of plugins with predefined values for all plugin options. A project configuration extends a pipeline by replacing/adding new plugins or overriding the predefined plugin options.
Take a look at Examples for real world project configurations.
Base Pipeline: Static Content
Used to scrape static data, does not rely on javascript to either read or alter the html content.
Comes in two variants browser-static-content, dom-static-content. First one runs in browser, second one makes use of a dom-like parsing library such as cheerio.
Pipeline: Browser-Static-Content
Sequentially executed plugins.
-
BrowserFetchPlugin
-
gotoOptions.timeout
: 30000 -
gotoOptions.waitUntil
: domcontentloaded -
stabilityCheck
: 0 -
stabilityTimeout
: 0
-
-
ExtractUrlsPlugin
-
domRead
: true -
maxDepth
: -1 -
selectorPairs
: [ { urlSelector: ‘a[href$=”.html”]’ } ]
-
-
ExtractHtmlContentPlugin
-
domRead
: true -
selectorPairs
: []
-
-
InsertResourcesPlugin
-
maxResources
: -1
-
-
UpsertResourcePlugin
-
keepHtmlData
: false
-
Pipeline: Dom-Static-Content
Sequentially executed plugins.
-
NodeFetchPlugin
-
headers
: { ‘Accept-Encoding’: ‘br,gzip,deflate’ }
-
-
ExtractUrlsPlugin
-
domRead
: false -
maxDepth
: -1 -
selectorPairs
: [ { urlSelector: ‘a[href$=”.html”]’ } ]
-
-
ExtractHtmlContentPlugin
-
domRead
: false -
selectorPairs
: []
-
-
InsertResourcesPlugin
-
maxResources
: -1
-
-
UpsertResourcePlugin
-
keepHtmlData
: false
-
Static Content Examples
Limit scraping to a single page by setting ExtractUrlsPlugin.maxDepth
to 0
.
scraper.scrape({
name: "singlePageScraping",
pipeline: 'browser-static-content',
pluginOpts: [
{
name: 'ExtractUrlsPlugin',
maxDepth: 0,
}
],
resources: [
{
url: 'startUrl'
}
]
})
Scrape from each html page all elements found by the h1.title
CSS selector.
scraper.scrape({
name: 'h1TitleScraping',
pipeline: 'browser-static-content',
pluginOpts: [
{
name: 'ExtractHtmlContentPlugin',
selectorPairs: [
{
contentSelector: 'h1.title',
label: 'main title',
},
]
}
],
resources: [
{
url: 'startUrl'
}
]
})
Add a new ScrollPlugin
to the pipeline and scroll html pages to reveal further dynamically loaded content.
scraper.scrape({
name: 'scrollScraping',
pipeline: 'browser-static-content',
pluginOpts: [
{
name: 'ScrollPlugin',
after: 'UpsertResourcePlugin',
stabilityCheck: 1000,
}
],
resources: [
{
url: 'startUrl'
}
]
})
Replace ExtractHtmlContentPlugin
with an external plugin. path
is relative to current working directory / config file directory when invoked from a module / cli.
scraper.scrape({
name: 'customScraping',
pipeline: 'browser-static-content',
pluginOpts: [
{
name: 'H1CounterPlugin',
path: "../plugins/h1-counter-plugin.js",
replace: 'ExtractHtmlContentPlugin',
customOptionA: 1000,
customOptionB: 'h1',
}
],
resources: [
{
url: 'startUrl'
}
]
})