I’ve recently been using NodeJS build website scrapers quickly, and usually in less than 100 lines of code. This tutorial shows how you can easily create your own scraper using two NodeJs modules:
cheerio. The same code can easily be adapted to perform complex tasks like completing and submitting a form.
Assuming you already have NodeJS installed, we require two additional modules to be installed:
- request – The simplest way possible to make
- cheerio – A fast, flexible, and lean implementation of core jQuery designed specifically for the server
You can install them by using the
npm (Node Package Manager) command-line tool:
Once installed, we’re ready to create our scraper. In my tutorial, we use the request module to fetch a webpage, and use cheerio to select the elements we need. Since cheerio is an implementation of jQuery, we can use the same selectors to select and extract information from the page we just scraped.
On lines 6-10, we set the default settings we want to use. Since most pages will drop cookies, it’s a good idea to keep
jar set to
true. This will mean that any cookies the page sets will be passed to subsequent requests.
On line 14, we set the URL we want to scrape. Lines 15-17 show an example of custom headers we can pass. Here, you can set anything from authentication headers to content-type, cookies and user-agent.
On lines 19-27, we load the content of the page into cheerio and then select the elements we’re interested in. If it works correctly, the script will return your IP address, host-name and user-agent. Here’s an example:
The above is an example of a
GET request, but
POST requests are possible too. You can make POST requests by calling
req.post and pass in form data using the
form variable in the POST function. Example: