Create WebPage Scrapper Using Node js and Express

Reading Time: 8 minutes
314 Views

Creating a web scraper using Node.js and Express allows developers to extract data from websites efficiently. In this tutorial, we’ll explore the development process of a web scraper using Node.js and Express, enabling users to retrieve specific data from web pages.

In this tutorial, we’ll cover essential concepts including setting up a Node.js project, using libraries like axios or node-fetch for HTTP requests, utilizing libraries like cheerio or puppeteer for parsing HTML and extracting data, and implementing routes with Express to handle scraping requests.

Read More: Creating CRUD APIs with Node.js and Sequelize CLI

Let’s get started.

What is Web Page Scrapper?

A web page scraper, often referred to as a web scraper or web scraping tool. It is a program or script designed to extract data from websites. It automatically navigates through web pages, collects specific information, and then organizes and stores that data for various purposes.

Web scraping involves fetching and extracting data from web pages by analyzing the HTML structure of the web page. This process can involve accessing URLs, parsing HTML content, and extracting specific data points or elements like text, images, tables, links, or any other relevant information.

Here, we will create webpage scrapper which scraps Headings, Links, Anchors, Images, Meta, etc.

Steps To Create Webpage Scrapper with Node js and Express

Create an application folder with name webpage-scrap

>> “package.json” setup

Open project terminal and run this command,

npm init -y

It will create a file package.json file with all default values in it. If you open you will see as,

{
    "name": "scrapper",
    "version": "1.0.0",
    "description": "",
    "main": "index.js",
    "scripts": {
        "test": "echo \"Error: no test specified\" && exit 1"
    },
    "keywords": [],
    "author": "",
    "license": "ISC",
    "dependencies": {
        "axios": "^1.6.2",
        "cheerio": "^1.0.0-rc.12",
        "express": "^4.18.2",
        "nodemon": "^3.0.1"
    }
}

>> Main entry file “app.js” setup

Create app.js file into your application. Once you create, update package.json

"main": "app.js",

>> Installation of Node Packages

Open project terminal and run this command to install following node packages,

npm axios cheerio express nodemon

Above command installs 4 node packages into your setup.

Usage of Node packages,

  • The axios package is used in the example code to make HTTP requests to fetch the HTML content of a web page.
  • Cheerio is a popular library used for web scraping in Node.js. It essentially provides a jQuery-like interface to traverse and manipulate the HTML and XML documents, making it easier to extract specific elements and data from web pages.
  • express is the popular node js web application framework.
  • nodemon package installed to listen application changes and restarts web server.

Read More: Create HTTP Web Server in Node js Using Express js

>> Code Setup to Scrap Webapge by URL

Open app.js and write this complete code into it,

const express = require("express");
const axios = require('axios');
const cheerio = require('cheerio');

const PORT = 8087;

const app = express();

// Scrap website
app.get("/scrap", (req, res) => {

    // URL of the website you want to scrape
    const url = 'YOUR_WEBPAGE_URL';

    // Fetch the HTML content of the website
    axios.get(url)
        .then((response) => {
            const html = response.data;

            // Load the HTML into Cheerio
            const $ = cheerio.load(html);

            // Extract title
            const title = $('title').text().trim();

            // Extract meta tags
            const metaTags = [];
            $('meta').each((index, element) => {
                const tag = {};
                tag.name = $(element).attr('name') || $(element).attr('property') || $(element).attr('charset') || $(element).attr('http-equiv');
                tag.content = $(element).attr('content');
                metaTags.push(tag);
            });

            // Extract links
            const links = [];
            $('a').each((index, element) => {
                const href = $(element).attr('href');
                if (href) {
                    links.push(href);
                }
            });

            // Extract script sources
            const scripts = [];
            $('script').each((index, element) => {
                const src = $(element).attr('src');
                if (src) {
                    scripts.push(src);
                }
            });

            // Extract images
            const images = [];
            $('img').each((index, element) => {
                const src = $(element).attr('src');
                if (src) {
                    images.push(src);
                }
            });

            // Initialize arrays for different headings
            const h1Headings = [];
            const h2Headings = [];
            const h3Headings = [];
            const h4Headings = [];
            const h5Headings = [];
            const h6Headings = [];

            // Extract headings and push into respective arrays
            $('h1, h2, h3, h4, h5, h6').each((index, element) => {
                const text = $(element).text();
                const tagName = $(element).prop('tagName').toLowerCase();
                switch (tagName) {
                    case 'h1':
                        h1Headings.push(text);
                        break;
                    case 'h2':
                        h2Headings.push(text);
                        break;
                    case 'h3':
                        h3Headings.push(text);
                        break;
                    case 'h4':
                        h4Headings.push(text);
                        break;
                    case 'h5':
                        h5Headings.push(text);
                        break;
                    case 'h6':
                        h6Headings.push(text);
                        break;
                    default:
                        break;
                }
            });

            // Display extracted data
            res.json({
                title: title,
                website: url,
                meta: metaTags,
                links: links,
                scripts: scripts,
                images: images,
                headings: {
                    H1: h1Headings,
                    H2: h2Headings,
                    H3: h3Headings,
                    H4: h4Headings,
                    H5: h5Headings,
                    H6: h6Headings
                }
            });
        })
        .catch((error) => {
            res.json({
                website: url,
                error: error
            });
        });
});

app.listen(PORT, () => {

    console.log("Application started...");
});

Read More: Nodejs Express REST APIs with JWT Authentication Tutorial

All Done!

Application Testing

Open project terminal and run this command,

npx nodemon

Above command will start development server

URL: http://localhost:8087/scrap

That’s it.

We hope this article helped you to learn about Create WebPage Scrapper Using Node js and Express Tutorial in a very detailed way.

Online Web Tutor invites you to try Skillshike! Learn CakePHP, Laravel, CodeIgniter, Node Js, MySQL, Authentication, RESTful Web Services, etc into a depth level. Master the Coding Skills to Become an Expert in PHP Web Development. So, Search your favourite course and enroll now.

If you liked this article, then please subscribe to our YouTube Channel for PHP & it’s framework, WordPress, Node Js video tutorials. You can also find us on Twitter and Facebook.