Export Scraped Content to CSV, Zip

Scraped content is stored at database level in resource entries under a project. See Storage for more info. The base exporter constructor takes an options parameter containing:

filepath - Location to store the content, absolute or relative to the current working directory.
pageLimit- Number of resources to be retrieved when doing a bulk read. Defaults to 100.

All resources under a given project can be exported from an exporter instance.

CSV Exporter

Exports scraped content as csv.

fieldSeparator
- default: ','
lineSeparator
- default: '\n'

const { CsvExporter } = require('@get-set-fetch/scraper');
const exporter = new CsvExporter({ filepath: 'file.csv', fieldSeparator: ',' });
await exporter.export(project);

                Copied!

ZIP Exporter

Exports binary resources as a series of zip archives, one for each bulk read with size controlled by pageLimit.

const { ZipExporter } = require('@get-set-fetch/scraper');
const exporter = new ZipExporter({ filepath: 'archive.zip');
await exporter.export(project);

                Copied!

Custom Exporter

A custom exporter extends the base Exporter class and implements the following methods:

getResourceQuery
- Returns a subset of the following:
  - cols:
    - Which database columns / resource attributes to be retrieved.
    - Default: ['url', 'content'].
  - where:
    - Filter resources by attribute value. Ex: where: {projectId: 'someId'}.
  - whereNotNull
    - Filter resources by non-null attributes. Ex: whereNotNull: ['data']. This condition is used by ZipExporter to only archive binary data stored in the resource data attribute.
preParse
- Global actions to be done before parsing the resources, like opening files/streams.
parse
- Invoked on each resource, it receives resource, resourceIdx as input parameters. Output for each resource is generated in this method.
postParse
- Global actions to be done after parsing the resources, like closing files/streams.

Here’s an example of writing each resource url to a file.

export default class CustomExporter extends Exporter {
  wstream: fs.WriteStream;

  getResourceQuery() {
    return { cols: [ 'url' ] };
  }

  async preParse() {
    this.wstream = fs.createWriteStream(this.opts.filepath);
  }

  async parse(resource, resourceIdx) {
    this.wstream.write(`${resourceIdx}-${resource.url}\n`);
  }

  async postParse() {
    this.wstream.close();
  }
}

                Copied!