Node.js makes a simple crawler case tutorial

Node.js makes a simple crawler case tutorial

Preparation

  1. First, you need to download nodejs, which should be no problem.
  2. The original text requires downloading webstrom. I already have it on my computer, but I don't need to download it. I can just operate it in the command line.

Create a project

Now that the preparations are done, let's start creating the project.

  1. First, create a folder where you want to put the resources. For example, I created a myStudyNodejs folder in the E drive.
  2. Enter the folder you created in the command line as shown in the picture and enter the E drive: E:
    Go into the folder: cd myStudyNodejs (the name of the folder you created)
    Note that all are English symbols
  3. Initialize the project. Run npm init in the folder you created. Press Enter all the way to initialize the project. Finally, type yes.
  4. After running, a package.json file will be generated in the folder, which contains some basic information of the project.
  5. Install the required packages and run in the directory of the created folder
    npm install cheerio –save
    npm install request-save
    If you want to climb Wuhan University, these two packages are enough. If you want to climb Caoliu, you need an additional encoding conversion package. The one on Windows is
    npm install iconv-lite -save
    On Mac, use npm install iconv -save
    The result should be like the second picture, but I missed a letter in the middle.
  6. Create a file under the folder you created to create a data folder to save the crawled text data.
    Create an image folder to store image data.
    Create a js file to write the program. For example, study.js. (Create a notepad file and change .txt to .js)
    The purpose of –save is to write the project's dependency on the package into the package.json file.

Figure 1

Figure 2

Wuhan University School of Computer Science News Crawler Code

The following is the crawler code for the news of the School of Computer Science of Wuhan University. Copy it to the created .js file and save it.

var http = require('http');
var fs = require('fs');
var cheerio = require('cheerio');
var request = require('request');
var i = 0;
//Initial url 
var url = "http://cs.whu.edu.cn/a/xinwendongtaifabu/2018/0428/7053.html"; 

function fetchPage(x) { //Encapsulates a layer of function startRequest(x); 
}

function startRequest(x) {
     //Use http module to initiate a get request to the server http.get(x, function (res) {     
        var html = ''; //Used to store the entire HTML content of the requested web page var titles = [];    
        res.setEncoding('utf-8'); //Prevent Chinese garbled characters //Listen to data events and take a piece of data at a time res.on('data', function (chunk) {   
	      html += chunk;
	 });
     //Listen for the end event. If the HTML of the entire webpage content is obtained, execute the callback function res.on('end', function () {
         var $ = cheerio.load(html); //Use cheerio module to parse html
         var news_item = {
          //Get the title of the article title: $('div#container dt').text().trim(),
          i: i = i + 1,     
       };

	  console.log(news_item); //Print news information var news_title = $('div#container dt').text().trim();
	  savedContent($,news_title); //Store the content and title of each article savedImg($,news_title); //Store the image and title of each article //URL of the next article
       var nextLink="http://cs.whu.edu.cn" + $("dd.Paging a").attr('href');
       str1 = nextLink.split('-'); //Remove the Chinese characters after the url str = encodeURI(str1[0]);  
       //This is one of the highlights. By controlling I, you can control how many articles to crawl. Wuhan University has only 8 articles, so it is set to 8
       if (i <= 8) {                
          fetchPage(str);
       }
	});
}).on('error', function (err) {
      console.log(err);
    });
 }
//The function is to store the crawled news content resources locally function savedContent($, news_title) {
	$('dd.info').each(function (index, item) {
		var x = $(this).text();       
		var y = x.substring(0, 2).trim();
		if (y == '') {
			x = x + '\n';   
			//Add the news text content to the /data folder piece by piece, and name the file with the title of the news fs.appendFile('./data/' + news_title + '.txt', x, 'utf-8', function (err) {
				if (err) {
				console.log(err);
				}
			});
		}	
	})
}       
//The function is to store the crawled image resources locally function savedImg($,news_title) {
  $('dd.info img').each(function (index, item) {
        var img_title = $(this).parent().next().text().trim(); //Get the title of the image if(img_title.length>35||img_title==""){
         	img_title="Null";
        }
        var img_filename = img_title + '.jpg';
        var img_src = 'http://cs.whu.edu.cn' + $(this).attr('src'); //Get the URL of the image

		//Use the request module to initiate a request to the server to obtain image resources request.head(img_src,function(err,res,body){
		  if(err){
		    console.log(err);
		  }
		});
		request(img_src).pipe(fs.createWriteStream('./image/'+news_title + '---' + img_filename)); //Write the image to the local /image directory through streaming, and use the title of the news and the title of the image as the name of the image.
	})
}

fetchPage(url); //The main program starts running

Now comes the exciting moment. In the current folder, run the created js file, for example, mine is news.js.

npm news.js

Figure 3

Text resources:

Figure 4

Image resources:

Figure 5

Caoliu Technology Forum Crawler

I was not satisfied after crawling the news of Wuhan University, so I tried to crawl the technical discussion forum of Caoliu (of course, I can also crawl some things you understand). There were some problems encountered.
When crawling the grass, the HTTP request message header needs to contain the User-Agent field, so the initial URL needs to be changed as follows

var url = {
	hostname: 'cl.5fy.xyz',
	path: '/thread0806.php?fid=7',
	headers: {
		'Content-Type': 'text/html',
  	//Without this field, you cannot access it 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36',  
  }};

Secondly, nodejs only supports crawling websites with utf-8 character encoding, so you need to install additional packages to convert the encoding, so modify the code as follows

/*
* @Author: user
* @Date: 2018-04-28 19:34:50
* @Last Modified by: user
* @Last Modified time: 2018-04-30 21:35:26
*/
var http = require('http');
var fs = require('fs');
var cheerio = require('cheerio');
var request = require('request');
var iconv = require('iconv-lite');
var i = 0;
  //Used to determine whether to store or access var temp=0;
  let startPage=3;//Which page to start crawling from let page=startPage;
  let endPage=5;//Which page to crawl let searchText='';//Crawled keywords, all crawled by default, according to your needs //Initial url 
  var url = {
  hostname: '1024liuyouba.tk',
  path: '/thread0806.php?fid=16'+'&search=&page='+startPage,
  headers: {
    'Content-Type': 'text/html',
    //Without this field, you cannot access 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36',  
  }};
//Store the home page url
urlList=[];
//Encapsulates a layer of function function fetchPage(x) { 
  setTimeout(function(){  
    startRequest(x); },5000)
}
//First store the url of the interface to be accessed
function getUrl(x){
  temp++;
  http.get(x,function(res){
    var html = ''; 
    res.setEncoding('binary');
    res.on('data', function (chunk) {   
      html += chunk;
    });
    res.on('end', function () {
      var buf = new Buffer(html,'binary');
      var str=iconv.decode(buf,'GBK');
          var $ = cheerio.load(str); //Use cheerio module to parse HTML                
          $('tr.tr3 td.tal h3 a').each(function(){
            var search = $(this).text();
            if(search.indexOf(searchText)>=0){
            var nextLink="http://cl.5fy.xyz/" + $(this).attr('href');
            str1 = nextLink.split('-'); //Remove the Chinese characters after the url str = encodeURI(str1[0]); 
            urlList.push(str); }
          })
          page++;
          if(page<endPage){
            //Store the next page URL
            x.path='/thread0806.php?fid=16'+'&search=&page='+page,
            getUrl(x);
          }else if(urlList.length!=0){
            fetchPage(urlList.shift());
          }else{
            console.log('No keywords found!');
          }
        })
  }).on('error', function (err) {
    console.log(err);
  });

}
function startRequest(x) {
  if(temp===0){
    getUrl(x);     
  }   
  else{
     //Use http module to initiate a get request to the server http.get(x, function (res) {     
        var html = ''; //Used to store the entire HTML content of the requested web page res.setEncoding('binary');
        var titles = [];        
	     //Listen to the data event and fetch a chunk of data at a time res.on('data', function (chunk) {   
	      html += chunk;
	    });
	     //Listen for the end event. If the HTML of the entire webpage content is obtained, execute the callback function res.on('end', function () {
	    	var buf = new Buffer(html,'binary');
	    	var str=iconv.decode(buf,'GBK');
	        var $ = cheerio.load(str); //Use cheerio module to parse HTML
	        var news_item = {
	          	//Get the title of the article title: $('h4').text().trim(),
	        	//i is used to determine how many articles have been obtained i: i = i + 1,     
	      	};
	    console.log(news_item); //Print information var news_title = $('h4').text().trim();
		
	  	savedContent($,news_title); //Store the content and title of each article		
	  	savedImg($,news_title); //Store the image and title of each article		
	  	//If the access is not completed, continue to access if (urlList.length!=0) {
	    	fetchPage(urlList.shift());
	  	}
	});
}).on('error', function (err) {
    console.log(err);
  });
 }
}
       //The function is to store the crawled text content resources locally function savedContent($, news_title) {
	$("div.t2[style].tpc_content.do_not_catch").each(function (index, item) {
          var x = $(this).text();       
          x = x + '\n';   
		  //Add the news text content to the /data folder piece by piece, and name the file with the title of the news fs.appendFile('./data/' + news_title + '.txt', x, 'utf-8', function (err) {
			  if (err) {
			    console.log(err);
			  }
		  });
		})
 }
//The function is to store the crawled image resources locally function savedImg($,news_title) {
  //Create a folder fs.mkdir('./image/'+news_title, function (err) {
        if(err){console.log(err)}
      });
  $('.tpc_content.do_not_catch input[src]').each(function (index, item) {
        var img_title = index; //Add a number to each picture var img_filename = img_title + '.jpg';
        var img_src = $(this).attr('src'); //Get the URL of the image
//Use the request module to initiate a request to the server to obtain image resources request.head(img_src,function(err,res,body){
  if(err){
    console.log(err);
  }
});
setTimeout(function(){
  request({uri: img_src, encoding: 'binary'}, function (error, response, body) {
    if (!error && response.statusCode == 200) {
      fs.writeFile('./image/'+news_title+'/' + img_filename, body, 'binary', function (err) {
        if(err){console.log(err)}
      });
    }
  })
});
})
}
fetchPage(url); //The main program starts running

Results:

Write the picture description here

This is the end of this article about how to make a simple crawler with node.js. For more information about making a simple crawler with node.js, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope you will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • First experience with node.js crawler framework node-crawler
  • How to get weather and daily greetings with Node.js crawler
  • Implementation of a node.js crawler tool based on cheerio (a crawler tool that requires login permission)
  • Explanation of implementing crawlers based on node.js
  • Node.js learning notes: koa framework and simple crawler exercises
  • Teach you how to crawl website data with Node.js crawler
  • Node.js uses cheerio to create a simple web crawler example
  • A brief discussion on the web page request module of Node.js crawler

<<:  SSM implements the mysql database account password ciphertext login function

>>:  Detailed explanation of MYSQL database table structure optimization method

Recommend

Solution to MySQL Chinese garbled characters problem

1. The Chinese garbled characters appear in MySQL...

Detailed explanation of the this pointing problem in JavaScript

Summarize Global environment ➡️ window Normal fun...

Several ways to implement image adaptive container with CSS (summary)

There is often a scenario where the image needs t...

Example code of how to create a collapsed header effect using only CSS

Collapsed headers are a great solution for displa...

OpenLayers realizes the method of aggregate display of point feature layers

Table of contents 1. Introduction 2. Aggregation ...

Solution for FileZilla 425 Unable to connect to FTP (Alibaba Cloud Server)

Alibaba Cloud Server cannot connect to FTP FileZi...

Linux dual network card binding script method example

In Linux operation and configuration work, dual n...

How to start/stop Tomcat server in Java

1. Project Structure 2.CallTomcat.java package co...

Vue realizes the progress bar change effect

This article uses Vue to simply implement the cha...

Why MySQL should avoid large transactions and how to solve them

What is a big deal? Transactions that run for a l...

Solution for Nginx installation without generating sbin directory

Error description: 1. After installing Nginx (1.1...

TCP performance tuning implementation principle and process analysis

Three-way handshake phase Number of retries for c...

Build a Docker private warehouse (self-signed method)

In order to centrally manage the images we create...