Blog Posts

These blog entries are meant to showcase some of the @get-set-fetch/scraper capabilities.

In no particular order, each post implements a scraping scenario I’ve encountered while browsing relevant topics on stackoverflow, reddit and such. Some of them are not that interesting but do illustrate core scraper concepts. Source code for all posts is available in the github repo under /examples.

Cloud Scraping with Terraform and Ansible - Running Existing Projects

Detailed steps for running the dataset projects available under github.com/get-set-fetch/scraper/datasets.
Unless otherwise specified each project provisions one central PostgreSQL instance and 20 scraper instances deployed on DigitalOcean Frankfurt FRA1 datacenter.

Randomize the TLS fingerprint

Modify the default nodejs TLS cipher list.
Shuffle it so that the scraper can no longer be identified as a nodejs app during the TLS handshake.

Combine database with in-memory storage

In-memory scrape queue. Store scraped content in a database.
Explore the ways you can combine multiple storage options to scrape a project.

Check for browser console errors

Extract console content from a site.
Extend the built-in puppeteer client to record console events on each newly opened page.

Generate a sitemap

Follow and extract all html links from a site.
Override default plugins and use a custom exporter to generate the sitemap based on the scraped URLs.