Web Niraj
  • Facebook
  • Flickr
  • Github
  • Linkedin
  • Twitter
  • YouTube
Online portfolio, code examples and developer blog
  • About
  • Contact
  • Portfolio
  • WordPress
Search the site...
  • Home
  • Blog
  • NodeJS: Scraping Websites Using Request and Cheerio

NodeJS: Scraping Websites Using Request and Cheerio

2

I’ve recently been using NodeJS build website scrapers quickly, and usually in less than 100 lines of code. This tutorial shows how you can easily create your own scraper using two NodeJs modules: request and cheerio. The same code can easily be adapted to perform complex tasks like completing and submitting a form.

Node Modules

Assuming you already have NodeJS installed, we require two additional modules to be installed:

  • request – The simplest way possible to make http and https calls
  • cheerio – A fast, flexible, and lean implementation of core jQuery designed specifically for the server

You can install them by using the npm (Node Package Manager) command-line tool:

The Scraper

Once installed, we’re ready to create our scraper. In my tutorial, we use the request module to fetch a webpage, and use cheerio to select the elements we need. Since cheerio is an implementation of jQuery, we can use the same selectors to select and extract information from the page we just scraped.

On lines 6-10, we set the default settings we want to use. Since most pages will drop cookies, it’s a good idea to keep jar set to true. This will mean that any cookies the page sets will be passed to subsequent requests.

On line 14, we set the URL we want to scrape. Lines 15-17 show an example of custom headers we can pass. Here, you can set anything from authentication headers to content-type, cookies and user-agent.

On lines 19-27, we load the content of the page into cheerio and then select the elements we’re interested in. If it works correctly, the script will return your IP address, host-name and user-agent. Here’s an example:

NodeJS-Scraping-Example

POST Requests

The above is an example of a GET request, but POST requests are possible too. You can make POST requests by calling req.post and pass in form data using the form variable in the POST function. Example:

JavaScript, jQuery, NodeJS, Scraping

2 comments on “NodeJS: Scraping Websites Using Request and Cheerio”

  1. Terence Watson says:
    December 14, 2016 at 8:07 AM

    When I do post requests, how do I deal with the security tokens and sessions and stuff. I can’t really just do a regular post request, it seems I need to pass in extra security related random values. Thanks in advance! 👍

    Reply
    • Niraj Shah says:
      December 14, 2016 at 11:36 AM

      If you need to rely on a page-generated CSRF token or similar, you can use a GET request to get the initial page, and then use Cheerio to get the contents of a form, token etc. And then pass that on to a subsequent POST request.

      Here is a very rough example:

      request.get({
        	url: "https://domain.com/page",
        	jar: true,
        	followAllRedirects: true
        }, function(err, resp, body){
      	
          var $ = cheerio.load(body);
          
          var token = $('[name="_token"]').val();
          
          request.post({
            url: "https://domain.com/post",
            form: { '_token': token },
            jar: "true",
            followAllRedirects: true
          }, function(error, response, body){
            
              // do something with the result
              
          }); // end of post
      	
      }); // end of get
      Reply

Leave a Reply to Niraj ShahCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

StackExchange / StackOverflow

profile for Niraj Shah on Stack Exchange, a network of free, community-driven Q&A sites

Support Me

Buy Me a Coffee

PSN Profile

Tags

ACL Amazon Amazon Web Services Android Android 4.4 KitKat Android 5.0 Lollipop Apache Backup Bug Command Line Cordova cPanel / WHM Facebook Facebook Graph API Facebook PHP SDK 4.0 Facebook Social Plugins Fan Page Firewall Flash Gadget Geolocation Google Nexus 5 Hacking HTML5 iOS JavaScript jQuery Laravel 5 Linux NodeJS Parse PDF PHP Plugin Portfolio PS4 Review Security Server SSH SSL Sysadmin Tutorial WordPress WordPress Plugins
© 2011-2025 Niraj Shah
  • Blog
  • Portfolio
  • WordPress
  • About Me
  • Contact Me
  • Privacy Policy
Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Privacy Policy