These blog entries are meant to showcase some of the @get-set-fetch/scraper capabilities.
In no particular order, each post implements a scraping scenario I’ve encountered while browsing relevant topics on stackoverflow, reddit and such. Some of them are not that interesting but do illustrate core scraper concepts. Source code for all posts is available in the github repo under /examples.
Detailed steps for running the dataset projects available under github.com/get-set-fetch/scraper/datasets.
Unless otherwise specified each project provisions one central PostgreSQL instance and 20 scraper instances deployed on DigitalOcean Frankfurt FRA1 datacenter.
Modify the default nodejs TLS cipher list.
Shuffle it so that the scraper can no longer be identified as a nodejs app during the TLS handshake.
In-memory scrape queue. Store scraped content in a database.
Explore the ways you can combine multiple storage options to scrape a project.
Extract console content from a site.
Extend the built-in puppeteer client to record console events on each newly opened page.
Follow and extract all html links from a site.
Override default plugins and use a custom exporter to generate the sitemap based on the scraped URLs.