Scraping Content To Database

Each scraped URL (web page, image, API endpoint, …) represents a Resource. Binary content is stored under resource.data while text based content is stored under resource.content. Resources sharing the same project configuration and discovered from the same initial URL(s) are grouped in a Project.

Projects represent the starting point for any scraping operation. Each project has its own Queue and Resource collections with their own storage options. This means you can use one storage type for managing the queue and a different one for storing the scraped resources. To be scraped URLs are first added to Queue, successfully scraped content is saved under Resource.

You can add additional storage support by extending the Connection abstract class and implementing a subset of the IProjectStorage, IQueueStorage, IResourceStorage interfaces.
Currently supported databases - Sqlite, MySQL, PostgreSQL - are accessed using KnexConnection.

Check below connection examples for possible values of connConfig.

const { KnexConnection } = require('@get-set-fetch/scraper');
const conn = new KnexConnection(connConfig);

                Copied!

SQLite

Default storage option if none provided consuming the least amount of resources. Requires knex and sqlite driver.

$ npm install knex @vscode/sqlite3

                Copied!

SQLite connection example.

{
  "client": "sqlite3",
  "useNullAsDefault": true,
  "connection": {
    "filename": "gsf.sqlite"
  },
  "debug": false
}

                Copied!
                
            

MySQL

Requires knex and mysql driver.

$ npm install knex mysql

                Copied!

MySQL connection example.

{
  "client": "mysql",
  "useNullAsDefault": true,
  "connection": {
    "host": "localhost",
    "port": "33060",
    "user": "gsf-user",
    "password": "gsf-pswd",
    "database": "gsf-db"
  },
  "debug": false
}

                Copied!
                
            

PostgreSQL

Requires knex and postgresql driver.

$ npm install knex pg

                Copied!

PostgreSQL connection example.

{
  "client": "pg",
  "useNullAsDefault": true,
  "connection": {
    "host": "localhost",
    "port": "54320",
    "user": "gsf-user",
    "password": "gsf-pswd",
    "database": "gsf-db"
  },
  "debug": false
}

                Copied!