Docker Scraping

Build

All scraper images are based on alpine:3.14 docker image. You have to build the images locally; they’re not published on Docker Hub. For both docker build and run commands make the docker repo directory the current working directory.

A set of built-time variables allows you to customize the docker image.

  • BROWSER_CLIENT - puppeteer
  • DOM_CLIENT - cheerio, jsdom
  • STORAGE - sqlite, pg, mysql
  • VERSION - source
  • USER_ID
  • GROUP_ID

BROWSER_CLIENT and DOM_CLIENT variables are mutually exclusive. You either scrape using a headless browser or a HTML/DOM parser library.

USER_ID and GROUP_ID are used to add the gsfuser user to the container. This non-root user runs the scraper, reads and writes data to the /home/gsfuser/scraper/data container path mounted from the host. Use --build-arg USER_ID=$(id -u), --build-arg GROUP_ID=$(id -g) to provide the same uid/gid as the currently logged in user. If you’re on Windows you can ignore these two variables.

Create an image using cheerio, sqlite and latest source code.

docker build \
--tag getsetfetch \
--build-arg DOM_CLIENT=cheerio \
--build-arg STORAGE=sqlite \
--build-arg VERSION=source \
--build-arg USER_ID=$(id -u) \
--build-arg GROUP_ID=$(id -g) .

Create an image using puppeteer, sqlite and latest source code.

docker build \
--tag getsetfetch \
--build-arg BROWSER_CLIENT=puppeteer \
--build-arg STORAGE=sqlite \
--build-arg VERSION=source \
--build-arg USER_ID=$(id -u) \
--build-arg GROUP_ID=$(id -g) .

Run

All examples contain config, log, sqlite, csv files under /home/gsfuser/scraper/data container path mounted from the host for easy access to logs and exported scraped content. Remaining arguments represent command line arguments. All files are available in the docker repo directory.

SQLite and Cheerio

Log, scrape and export data using sqlite as storage and cheerio as dom client.

config-sqlite-cheerio.json

{
  "storage": {
    "client": "sqlite3",
    "useNullAsDefault": true,
    "connection": {
      "filename": "gsf.sqlite"
    },
    "debug": false
  },
  "client": {
    "name": "cheerio"
  },
  "project": {
    "name": "myProj",
    "pipeline": "dom-static-content",
    "pluginOpts": [
      {
        "name": "ExtractHtmlContentPlugin",
        "selectorPairs": [
          {
            "contentSelector": "h3"
          }
        ]
      },
	  {
        "name": "InsertResourcesPlugin",
        "maxResources": 1
      }
    ],
    "resources": [
      {
        "url": "https://www.getsetfetch.org/node/docker.html"
      }
    ]
  }
}
docker run \
-v <host_dir>/scraper/docker/data:/home/gsfuser/scraper/data getsetfetch:latest \
--version \
--config data/config-sqlite-cheerio.json \
--save \
--overwrite \
--scrape \
--loglevel info \
--logdestination data/scrape.log \
--export data/export.csv

SQLite and Puppeteer

Log, scrape and export data using sqlite as storage and puppeteer as browser client.
Use either --security-opt seccomp=unconfined or --security-opt seccomp=data/chromium-security-profile.json (source blog) to allow Chromium syscalls.

config-sqlite-puppeteer.json

{
  "storage": {
    "client": "sqlite3",
    "useNullAsDefault": true,
    "connection": {
      "filename": "gsf.sqlite"
    },
    "debug": false
  },
  "client": {
    "name": "puppeteer",
    "opts": {
      "ignoreHTTPSErrors": true,
      "args": [
        "--ignore-certificate-errors",
        "--no-first-run",
        "--single-process"
      ]
    }
  },
  "project": {
    "name": "myProj",
    "pipeline": "browser-static-content",
    "pluginOpts": [
      {
        "name": "ExtractHtmlContentPlugin",
        "selectorPairs": [
          {
            "contentSelector": "h3"
          }
        ]
      },
	  {
        "name": "InsertResourcesPlugin",
        "maxResources": 1
      }
    ],
    "resources": [
      {
        "url": "https://www.getsetfetch.org/node/getting-started.html"
      }
    ]
  }
}
docker run \
--security-opt seccomp=unconfined
-v <host_dir>/scraper/docker/data:/home/gsfuser/scraper/data getsetfetch:latest \
--version \
--config data/config-sqlite-puppeteer.json \
--save \
--overwrite \
--scrape \
--loglevel info \
--logdestination data/scrape.log \
--export data/export.csv

PostgreSQL and Puppeteer

Log, scrape and export data using postgresql as storage and puppeteer as browser client.
This starts the scraper as a docker-compose service.
Remember to build the corresponding docker image --build-arg STORAGE=pg --build-arg BROWSER_CLIENT=puppeteer first :)

config-pg-puppeteer.json

{
  "storage": {
    "client": "pg",
    "useNullAsDefault": true,
    "connection": {
      "host": "pg",
      "port": "5432",
      "user": "gsf-user",
      "password": "gsf-pswd",
      "database": "gsf-db"
    },
    "debug": false
  },
 "client": {
    "name": "puppeteer",
    "opts": {
      "ignoreHTTPSErrors": true,
      "args": [
        "--ignore-certificate-errors",
        "--no-first-run",
        "--single-process"
      ]
    }
  },
  "project": {
    "name": "myProj",
    "pipeline": "browser-static-content",
    "pluginOpts": [
      {
        "name": "ExtractHtmlContentPlugin",
        "selectorPairs": [
          {
            "contentSelector": "h3",
            "label": "headline"
          }
        ]
      },
      {
        "name": "InsertResourcesPlugin",
        "maxResources": 1
      },
      {
        "name": "UpsertResourcePlugin",
        "keepHtmlData": true
      }
    ],
    "resources": [
      {
        "url": "https://www.getsetfetch.org/node/getting-started.html"
      }
    ]
  }
}

docker-compose.yml

version: "3.3"
services:
  pg:
    image: postgres:11-alpine
    environment:
      POSTGRES_USER: gsf-user
      POSTGRES_PASSWORD: gsf-pswd
      POSTGRES_DB: gsf-db

  gsf:
    image: getsetfetch:latest
    command: >
      --version
      --config data/config-pg-puppeteer.json
      --save
      --overwrite
      --scrape
      --loglevel info
      --logdestination data/scrape.log
      --export data/export.csv
    volumes:
      - ../data:/home/gsfuser/scraper/data
    security_opt:
      - seccomp:"../data/chromium-security-profile.json"
    depends_on:
    - pg
      
volumes:
  data:
# start
docker-compose up -d

# stop
docker-compose down