At minimum a plugin needs to define two functions: test
and apply
. The former checks if the plugin should be invoked, the latter invokes it. Both functions can be executed in either nodejs or browser environments.
A plugin is executed in browser if it defines a domRead
or domWrite
option set to true
. When registering such a plugin via PluginStore.addEntry
a bundle is created containing all its dependencies, including node_modules ones. Try to have such dependencies in ECMAScript module format as this enables tree shaking keeping the bundle size to a minimum. Other module formats like CommonJS are non-deterministic at build time severely limiting tree shaking capabilities. The entire module may be added to the plugin bundle even though you’re only using a part of it. Importing typescript plugins via command line is not supported.
import { Readability } from '@mozilla/readability';
export default class ReadabilityPlugin {
opts = {
domRead: true,
}
test(project, resource) {
if (!resource) return false;
return (/html/i).test(resource.contentType);
}
apply() {
const article = new Readability(document).parse();
return { content: [ [ article.excerpt ] ] };
}
}
The above plugin checks if a web resource is already loaded and is of html type. If these test conditions are met, it extracts a page excerpt using @mozilla/readability
library. It runs in browser due to its domRead
option set to true
. content
is a predefined property at Resource level with a string[][]
type. Think of it as data rows with each row containing one or multiple entries. When extracting excerpts from a web page, there is only one row and it contains a single excerpt element.
Prior to scraping the plugin needs to be registered.
await PluginStore.init();
await PluginStore.addEntry(join(__dirname, 'ReadabilityPlugin.js'));
With the help of this plugin one can extract article excerpts from news sites such as BBC technology section. Custom ReadabilityPlugin
replaces builtin ExtractHtmlContentPlugin. Only links containing hrefs starting with /news/technology-
are followed. Scraping is limited to 5 articles. See full script.
const projectOpts = {
name: 'bbcNews',
pipeline: 'browser-static-content',
pluginOpts: [
{
name: 'ExtractUrlsPlugin',
maxDepth: 1,
selectorPairs: [
{ urlSelector: "a[href ^= '/news/technology-']" },
],
},
{
name: 'ReadabilityPlugin',
replace: 'ExtractHtmlContentPlugin',
domRead: true,
},
{
name: 'InsertResourcesPlugin',
maxResources: 5,
},
],
resources: [
{
url: 'https://www.bbc.com/news/technology'
}
]
};