Site icon Web Niraj

NodeJS: Scraping Websites Using Request and Cheerio

I’ve recently been using NodeJS build website scrapers quickly, and usually in less than 100 lines of code. This tutorial shows how you can easily create your own scraper using two NodeJs modules: request and cheerio. The same code can easily be adapted to perform complex tasks like completing and submitting a form.

Node Modules

Assuming you already have NodeJS installed, we require two additional modules to be installed:

You can install them by using the npm (Node Package Manager) command-line tool:

See the gist on github.

The Scraper

Once installed, we’re ready to create our scraper. In my tutorial, we use the request module to fetch a webpage, and use cheerio to select the elements we need. Since cheerio is an implementation of jQuery, we can use the same selectors to select and extract information from the page we just scraped.

See the gist on github.

On lines 6-10, we set the default settings we want to use. Since most pages will drop cookies, it’s a good idea to keep jar set to true. This will mean that any cookies the page sets will be passed to subsequent requests.

On line 14, we set the URL we want to scrape. Lines 15-17 show an example of custom headers we can pass. Here, you can set anything from authentication headers to content-type, cookies and user-agent.

On lines 19-27, we load the content of the page into cheerio and then select the elements we’re interested in. If it works correctly, the script will return your IP address, host-name and user-agent. Here’s an example:

POST Requests

The above is an example of a GET request, but POST requests are possible too. You can make POST requests by calling req.post and pass in form data using the form variable in the POST function. Example:

See the gist on github.

Exit mobile version