Cloud Scraping with Terraform and Ansible

The npm package contains a Terraform module and multiple Ansible roles for distributed cloud scraping using one PostgreSQL database and multiple scraper instances using dom-like clients such as Cheerio and Jsdom. Each web scraper consumes a queue of to-be-scraped URLs stored at database level, scrapes them concurrently and saves the content back.

Source files are available in the repository /cloud directory.
A real world example is available under @get-set-fetch/benchmarks/cloud.

Terraform Module

Available under node_modules/@get-set-fetch/scraper/cloud/terraform the module declares the following input variables:

public_key_file
- authorized key, part of a SSH key pair
- deployed and used on remote instances for SSH authentication
public_key_name
- name the public key is registered under
private_key_file
- identity key, part of a SSH key pair
- used locally for SSH authentication
ansible_inventory_file
- contains IP addresses of the created instances split into two groups: postgresql and scraper
- makes it possible to target either group via Ansible roles and/or playbooks
region
- region to deploy in
pg
- settings related to PostgreSQL instance
- name
  - instance name
- image
  - image id
  - default: ubuntu-20-04-x64
- size
  - instance type
  - default: s-4vcpu-8gb
- ansible_playbook_file
  - playbook to run after cloud-init
scraper
- settings related to scraper instances
- count
  - number of instances
  - default : 1
- name
  - instance name prefix, full names will be generated as “prefix-0”, “prefix-1”, …
- image
  - image id
  - default: ubuntu-20-04-x64
- size
  - instance type
  - default: s-1vcpu-1gb
- ansible_playbook_file
  - playbook to run after cloud-init

Supported providers:

DigitalOcean

Custom module example referencing the builtin one under node_modules.

module "cloud_scraping" {
  source = "../../node_modules/@get-set-fetch/scraper/cloud/terraform"

  region                 = "fra1"
  public_key_name        = "get-set-fetch"
  public_key_file        = var.public_key_file
  private_key_file       = var.private_key_file
  ansible_inventory_file = "../ansible/inventory/hosts.cfg"

  pg = {
    name                  = "pg"
    image                 = "ubuntu-20-04-x64"
    size                  = "s-4vcpu-8gb"
    ansible_playbook_file = "../ansible/pg-setup.yml"
  }

  scraper = {
    count                 = 1
    name                  = "scraper"
    image                 = "ubuntu-20-04-x64"
    size                  = "s-1vcpu-1gb"
    ansible_playbook_file = "../ansible/scraper-setup.yml"
  }
}

                Copied!
                
            

Ansible Roles

Available under node_modules/@get-set-fetch/scraper/cloud/ansible the provided Ansible roles can be either referenced from your own playbooks or via ansible cli. If invoked from the command line you also must provide the --private-key.

gsf-postgresql-logs

Retrieves systemd log messages since last boot and service status. Retrieves PostgreSQL logs.

Variables:

export_dir
- local export destination directory

ansible postgresql -u root -i <ansible_inventory_file> -m include_role -a "name=gsf-postgresql-logs" \
-e 'export_dir=<export_dir> \
--private-key <private_key_file> 

                Copied!

gsf-postgresql-setup

Creates a database and adds a user to it. Allows for remote connections ONLY from the virtual private cloud.
Tunes the server.

Variables:

db
- name
  - database to be created
- user
  - user granted ALL privileges on the above database
- password
  - corresponding user password
pg_config
- settings related to database tunning
- default values target a vm with 4 vCPU, 8 GB RAM using pgtune as base config
- each entry contains the setting key and its corresponding default value
- see PostgreSQL connection settings and resource consumptation for more details
  - max_connections: 100
  - shared_buffers: 2GB
  - effective_cache_size: 6GB
  - maintenance_work_mem: 512MB
  - checkpoint_completion_target: 0.9
  - wal_buffers: 16MB
  - default_statistics_target: 100
  - random_page_cost: 1.1
  - effective_io_concurrency: 200
  - work_mem: 10485kB
  - min_wal_size: 1GB
  - max_wal_size: 4GB
  - max_worker_processes: 4
  - max_parallel_workers_per_gather: 2
  - max_parallel_workers: 4
  - max_parallel_maintenance_workers: 2

gsf-scraper-export

Remotely invokes gsfscrape --export via command line. Downloads the generated csv file to export_file.

Variables:

work_dir
- remote directory containing a gsf-config.json config file.
export_file
- local csv file destination
log_level
- log level for the export action
log_destination
- remote log path relative to work_dir

gsf-scraper-logs

Downloads getsetfetch systemd service status, output and error logs. Downloads the scrape log.

Variables:

work_dir
- remote directory containing the scrape log
export_dir
- local destination directory
log_destination
- remote scrape log path relative to work_dir

ansible scraper -u root -i <ansible_inventory_file>  -m include_role -a "name=gsf-scraper-logs" \
-e 'export_dir=<export_dir> work_dir=<work_dir> log_destination=<log_destination> \
--private-key <private_key_file>

                Copied!

gsf-scraper-setup

Installs @get-set-fetch/scraper and dependencies. Creates a systemd service.
Uploads additional files like configuration, custom plugins, etc.

Variables:

db
- connection settings
- name
- user
- password
- pool
  - min
  - max
scraper
- uv_threadpool_size
  - number of threads used in libuv’s threadpool used by nodejs
  - relevant (from a scraping point of view) nodejs APIs using the threadpool: dns.lookup()
- npm_install
  - list of npm packages to be installed
  - either library@version or tarball archives
- log
  - destination
    - log filepath
  - level
    - minimum log level
- files
  - additional files to be copied on the remote vm
  - scrape_urls
    - csv file containing URLs to be scraped
  - gsf_config
    - json scrape configuration file
  - additional
    - list of any other files to be copied, like custom plugins referenced in gsf_config

Ansible playbook example.

roles:
  - role: gsf-scraper-setup
    vars:
      db:
        name: getsetfetch
        user: ""
        password: ""
        pool:
          min: 10
          max: 10
      scraper:
        uv_threadpool_size: 14 # 4 + 10
        npm_install:
          - [email protected]
          - [email protected]
          - [email protected]
          - "@get-set-fetch/[email protected]"
        log:
          level: error
        files:
            scrape_urls: '1000k_urls.csv'
            gsf_config: templates/gsf-config-saved-urls.json.j2
            additional:
                - PerfNodeFetchPlugin.js

Copied!

gsf-scraper-stats

Checks scraping progress. With each call adds a new row in a stats csv file.

Variables

project_name
- target scrape project
export_file
- local csv file destination
db_name
db_user
db_password

Example of stats content.
2xx - 5xx columns represent number of resources scraped with a corresponding http response status code.

time, 2xx, 3xx, 4xx, 5xx, in_progress, not_scraped, per_second, scraper_instances
1644588918, 74216, 0, 0, 0, 318, 925466, , 2
1644588940, 102469, 0, 0, 0, 345, 897186, 1284.2272727272727, 2
1644589144, 364334, 0, 0, 0, 272, 635394, 1283.6519607843138, 2
1644589365, 655711, 0, 0, 0, 78, 344211, 1318.447963800905, 2
1644589481, 814907, 0, 0, 0, 58, 185035, 1372.3793103448277, 2

                Copied!
                
            

ansible postgresql -u root -i <ansible_inventory_file> -m include_role -a "name=gsf-scraper-stats" \
-e 'project_name=<project_name> export_file=<export_file> db_name=<db_name> \
db_user=<db_user> db_password=<db_password>' \
--private-key <private_key>

                Copied!