Npm Module

Features


Web Crawler
  • download http and https resources
  • filter resources based on robots.txt user-agent and allow, disallow directives
  • define custom request headers: User-Agent, Authorization, Cookie, ...
  • react on response headers: either parse text/html data or store binary data
Web Scrapper
  • builtin DOM access: jsdom
  • builtin link discovery: customizable regexp, only internal links are taken into consideration
  • builtin link following: all newly found URLs are subject for future crawling
Storage
  • persist crawled and to-be-crawled resources
  • supported sql engines: sqlite:memory, sqlite:file, mysql, postgresql
  • supported nosql engines: mongodb
Performance
  • bloom filter usage for filtering duplicate URLs
Highly Customizable
  • clear separation of concerns using an extendable plugin architecture centered on crawling and scraping events
  • clean, easy to follow async handling using javascript es6 async/await syntax

Getting Started


import the get-set-fetch dependency

const GetSetFetch = require('get-set-fetch');

the entire code is async,
declare an async function in order to make use of await

async function simpleCrawl() {

init db connection, by default in memory sqlite

  const { Site } = await GetSetFetch.init();

load site if already present,
otherwise create it by specifying a name and the first url to crawl,
only links from this location down will be subject to further crawling

  let site = await Site.get('simpleSite');
  if (!site) {
    site = new Site(
      'simpleSite',
      'https://simpleSite/',
    );
    await site.save();
  }

keep crawling the site until there are no more resources to crawl

  await site.crawl();
}

start crawling

simpleCrawl();

Storage : Configurations


All sql storage options are managed using knex query builder. Mongodb storage option is managed using the official mongodb driver.
Both storage options have the same API for the Site and Resource entities but separate implementations.

If none provided the default storage option is in-memory sqlite.
In this case use a plugin like PersistResourcePlugin if you want to access resources after crawling completes.

In-Memory SQLite3
requires knex, sqlite3
npm install knex sqlite3 --save
{
    "info": "knex using in-memory sqlite3",
    "client": "sqlite3",
    "useNullAsDefault": true,
    "connection": {
        "filename": ":memory:"
    }
}
SQLite3
requires knex, sqlite3
npm install knex sqlite3 --save
{
    "info": "knex using file based sqlite3",
    "client": "sqlite3",
    "useNullAsDefault": true,
    "connection": {
        "filename": "./mydb.sqlite"
    }
}
MySQL
requires knex, mysql installation
npm install knex --save
{
    "info": "knex using mysql",
    "client": "mysql",
    "connection": {
        "host" : "mysql_host",
        "user" : "mysql_user",
        "password" : "mysql_password",
        "database" : "mysql_db"
    }
}
PostgreSQL
requires knex, postgresql installation
npm install knex --save
{
    "info": "knex using postgresql",
    "client": "pg",
    "connection": process.env.PG_CONNECTION_STRING,
}
mongoDB
requires mongodb driver, mongodb installation
npm install mongodb --save
{
    "info": "mongoDB",
    "url": "mongodb://db_host:db_port",
    "dbName": "get-set-fetch-test"
}

Storage : Resource Entity


Only resources not already crawled (crawledAt === null) are considered for crawling.
Breath-first crawling order is not guaranteed.

Persisted properties
Name Type Description
id number | objectId unique identifier, number for sql engines, objectId for mongodb
siteId number | objectId id of the site the resource belongs to
url string full URL of the resource
depth number starting from site url, number of link traversal operations required to reach the resource
info text json stringified text generated by various plugins
crawledAt date-time time of the last craw operation
a value of null indicates the resource hasn't been crawled yet
Transient properties
Name Type Description
contentType string
content string | buffer
document jsdom document instance
urlsToAdd [string] newly discovered resources (from html anchors, images, ...) after parsing the current resource

Storage: Site Entity


Each crawl process starts by defining a site with a base url from where resources will be identified and then processed.
The site url is used to create the first site resource with depth 0.

Each site has its own set of plugins and plugin options, making the crawl process customizable at site level.
Plugins and corresponding options are persisted, so after you've defined a site, you only have to retrieve it to continue crawling.

During SELECT phase a resource is selected for crawling.
During SAVE phase the current resource is updated and the newly found resources are added.
Pairwise independent bloom filters are used to determine if a given resource has already been added.

You can customize the storage behavior by writing your own plugins for the SELECT and SAVE phases.

Persisted properties
Name Type Description
id number | objectId unique identifier, number for sql engines, objectId for mongodb
name string unique name for the site to crawl, a site can be retrieved by name or id
url string crawling will start from this URL "downwards".
Example: site url is http://siteA/path1/.
Resources under http://siteA/path1/ or http://siteA/path1/subpath2/ will be crawled.
Higher resources like http://siteA/path2 will not be crawled.
If you want to crawl an entire site and not just a section of it, have the url point to the site domain name.
robotsTxt string site robots.txt content
resourceFilter buffer bloom filter for detecting duplicate URLs when adding new resources to crawl
plugins json array of {name, opts} representing the site plugins.
name represents the plugin constructor name.
opts represents the plugin options.
Site.crawlResource will execute the plugins against the currently crawled resource.
Transient properties
Name Type Description
No transient properties.

Plugins : Architecture


Each resource you crawl (html, image, pdf, ...) belongs to a site.
Each site has its own set of plugins and plugin options defining the crawling / scraping behavior for all resources linked to it.

A resource is crawled during a set of predefined phases:
Phase Sub-phases Functionality
SELECT PRE_SELECT, POST_SELECT based on crawledAt field, a resource is selected to be crawled
FETCH PRE_FETCH, POST_FETCH based on the defined request headers, the resource content is retrieved
PROCESS PRE_PROCESS, POST_PROCESS based on content-type, extension, etc., the resource content is scrapped
SAVE PRE_SAVE, POST_SAVE resource is updated, newly found resources are inserted for future crawling
BasePlugin
Each plugin extends BasePlugin.

Each phase can have one or multiple plugins attached to it.
A plugin instance declares the phase it belongs to via pluginInstance.getPhase().

During each phase, all corresponding plugins are tested for execution via pluginInstance.test(resource).
Depending on the resource some plugins may return false and have their execution skipped.
For example, JsDomPlugin can create a jsdom instance from html content but not from binary content.

A plugin is invoked via pluginInstance.apply(site, resource).
Its return contentent is merged with the current resource.

The architecture is easily extensible.
Just add a new plugin to a site via site.use(pluginInstance).
If the site already contains an instance of the same plugin, the new instance will replace the old one.

Plugins : Default Plugins


The default plugins provide the minimum functionality to crawl a site.
Starting from a given url, additional html pages are identified and crawled.

Phase Plugin Constructor Parameters Functionality
SELECT SelectResourcePlugin - selects a resource to crawl
FETCH NodeFetchPlugin
reqHeaders: {
    type: object,
    default: {}
}
downloads a resource
reqHeader can be any any header combination like user-agent, basic-auth
PRE_PROCESS JsDomPlugin
contentTypeRe: {
    type: regexp,
    default: /html/i
}
adds a document property (jsdom document instance) to the resource
will only be executed when the regexp matches request-header content-type
PROCESS ExtractUrlPlugin
contentTypeRe: {
    type: regexp,
    default: /html/i
},
extensionRe: {
    type: regexp,
    default: /^(html|htm|php)$/i
},
allowNoExtension: {
    type: boolean,
    default: true
},
adds a urlsToAdd property (new resource urls extracted from the current document) to the resource
contentTypeRe, extensionRe, allowNoExtension
POST_PROCESS RobotsFilterPlugin
content: {
    type: string,
}
filters out new resources (urlsToAdd) based on robots.txt rules
if no content is provided, at first execution it will parse site.robotsTxt as content
supported directives: user-agent, allow, disallow
SAVE InsertResourcePlugin - saves the newly identified resources (urlsToAdd) to be crawled
SAVE UpdateResourcePlugin - updates the resource with its persisted properties, transient ones are not saved

Plugins : Optional Plugins


Additional plugins not included by default.

Phase Plugin Constructor Parameters Functionality
SAVE PersistResourcePlugin
extensionRe: {
    type: regexp,
    default: /^(gif|png|jpg)$/i
},
target: {
    type: string,
    default: './tmp'
}
writes resource to disk
extensionRe - filter resources, target: directory where to save resources, if relative, it is constructed from the current running directory

Guides : Image Scrapper


import the get-set-fetch dependency

const GetSetFetch = require('get-set-fetch');

the entire code is async for image scrapper,
declare an async function in order to make use of await

async function imgCrawl() {

init db connection, by default in memory sqlite

  const { Site } = await GetSetFetch.init();

load site if already present,
otherwise create it by specifying a name and the first url to crawl,
only links from this location down will be subject to further crawling

  let site = await Site.get('imgSite');
  if (!site) {
    site = new Site(
      'imgSite',
      'https://imgSite/index.html',
    );

by default ExtractUrlPlugin only extracts html resources,
override the default plugin instance with a new one containing suitable options

    site.use(new GetSetFetch.plugins.ExtractUrlPlugin({
      extensionRe: /^(html|jpg|png)$/i,
    }));

add persistencePlugin to the current site,
specify what extensions to save and where

    site.use(new GetSetFetch.plugins.PersistResourcePlugin({
      target: './myTargetDir',
      extensionRe: /^(jpg|png)$/i,
    }));
    await site.save();
  }

keep crawling site until there are no more resources to crawl

  await site.crawl();

end async function

}

start crawling

imgCrawl();