A complete example of implementing a timed crawler with Nodejs

A complete example of implementing a timed crawler with Nodejs

Cause of the incident

A few days ago, I had to help my friend review the Captain Group of Bilibili. Searching the Captain List one by one is naturally not the first choice for a programmer. The right thing to do is to hand over the task to the computer and let it do it by itself. Theory established, start coding.

Since the API crawler of the known captain list uses Axios to directly access the interface

So I spent a little time to write this crawler, which I called bilibili-live-captain-tools 1.0

const axios = require('axios')
const roomid = "146088"
const ruid = "642922"
const url = `https://api.live.bilibili.com/xlive/app-room/v2/guardTab/topList?roomid=${roomid}&ruid=${ruid}&page_size=30`

const Captin = {
 1: 'Governor',
 2: 'Admiral',
 3: 'Captain'
}

const reqPromise = url => axios.get(url);

let CaptinList = []
let UserList = []

async function crawler(URL, pageNow) {
 const res = await reqPromise(URL);
 if (pageNow == 1) {
 CaptinList = CaptinList.concat(res.data.data.top3);
 }
 CaptinList = CaptinList.concat(res.data.data.list);
}


function getMaxPage(res) {

 const Info = res.data.data.info
 const { page: maxPage } = Info
 return maxPage
}


function getUserList(res) {

 for (let item of res) {
 const userInfo = item
 const { uid, username, guard_level } = userInfo
 UserList.push({ uid, username, Captin: Captin[guard_level] })
 }
}

async function main(UID) {
 const maxPage = await reqPromise(`${url}&page=1`).then(getMaxPage)
 for (let pageNow = 1; pageNow < maxPage + 1; pageNow++) {
 const URL = `${url}&page=${pageNow}`;
 await crawler(URL, pageNow);
 }
 getUserList(CaptinList)
 console.log(search(UID, UserList))
 return search(UID, UserList)
}

function search(uid, UserList) {
 for (let i = 0; i < UserList.length; i++) {
 if (UserList[i].uid === uid) {
 return UserList[i];
 }
 }
 return 0
}

module.exports = {
 main
}

Obviously, this crawler can only be triggered manually, and it needs a command line and node environment to run directly, so I opened a page service for it with Koa2 and wrote an extremely simple page

const Koa = require('koa');
const app = new Koa();
const path = require('path')
const fs = require('fs');
const router = require('koa-router')();
const index = require('./index')
const views = require('koa-views')



app.use(views(path.join(__dirname, './'), {
 extension: 'ejs'
}))
app.use(router.routes());

router.get('/', async ctx => {
 ctx.response.type = 'html';
 ctx.response.body = fs.createReadStream('./index.html');
})

router.get('/api/captin', async (ctx) => {
 const UID = ctx.request.query.uid
 console.log(UID)
 const Info = await index.main(parseInt(UID))
 await ctx.render('index', {
 Info,
 })
});

app.listen(3000);

Since the page has no throttling and anti-shake, the current version can only be crawled in real time, the waiting time is long, and frequent refreshes will naturally trigger B station's anti-crawler mechanism, so the current server IP is subject to risk control.

So bilibili-live-captain-tools 2.0 was born

function throttle(fn, delay) {
 var timer;
 return function () {
 var _this = this;
 var args = arguments;
 if (timer) {
  return;
 }
 timer = setTimeout(function () {
  fn.apply(_this, args);
  timer = null; // Clear the timer after executing fn after delay. At this time, timer is false and throttle trigger can enter the timer}, delay)
 }
}

Add throttling and anti-shake, and use pseudo real-time crawler (crawl once a minute through scheduled tasks)

In this case, we need to execute the crawler script regularly. At this time, I thought of using the schedule function of egg, but I don’t want to make a crawler program so "overkill". When I am in doubt, I just search on Baidu. So we have the following plan

Use Node Schedule to implement scheduled tasks

Node Schedule is a flexible cron and non-cron job scheduler for Node.js. It allows you to schedule a job (an arbitrary function) to be executed on specific dates, with optional recurrence rules. It only uses one timer at any given time (instead of re-evaluating upcoming jobs every second/minute).

1. Install node-schedule

npm install node-schedule
# or yarn add node-schedule

2. Basic Usage

Let’s take a look at the official examples.

const schedule = require('node-schedule');

const job = schedule.scheduleJob('42 * * * *', function(){
 console.log('The answer to life, the universe, and everything!');
});

The first parameter of schedule.scheduleJob needs to be entered according to the following rules

Node Schedule rules are shown in the following table

* * * * * *
┬ ┬ ┬ ┬ ┬ ┬
│ │ │ │ │ |
│ │ │ │ │ └ Day of the week, value range: 0 - 7, where 0 and 7 both represent Sunday │ │ │ │ └─── Month, value range: 1 - 12
│ │ │ └────── Date, value: 1 - 31
│ │ └───────── , value: 0 - 23
│ └──────────── points, value: 0 - 59
└─────────────── seconds, value: 0 - 59 (optional)
You can also specify a specific time, such as: const date = new Date()

Understand the rules and implement one yourself

const schedule = require('node-schedule');

// Define a time let date = new Date(2021, 3, 10, 12, 00, 0);

// Define a task let job = schedule.scheduleJob(date, () => {
 console.log("Current time:",new Date());
});

The above example means that the time will be reported at 12:00 on March 10, 2021.

3. Advanced Usage

In addition to the basic usage, we can also use some more flexible methods to implement scheduled tasks.

3.1. Execute once every minute

const schedule = require('node-schedule');

// Define rules let rule = new schedule.RecurrenceRule();
rule.second = 0
//Execute once every minute at 0 seconds //Start the task let job = schedule.scheduleJob(rule, () => {
 console.log(new Date());
});

The rule supports the following values: second, minute, hour, date, dayOfWeek, month, year, etc.

Some common rules are shown in the following table:

Executions per second
rule.second = [0,1,2,3......59];
Execute every minute at 0 seconds
rule.second = 0;
Execute every 30 minutes
rule.minute = 30;
rule.second = 0;
Executed at 0:00 every day
rule.hour =0;
rule.minute =0;
rule.second =0;
Executed at 10:00 on the 1st of every month
rule.date = 1;
rule.hour = 10;
rule.minute = 0;
rule.second = 0;
Executed every Monday, Wednesday, and Friday at 0:00 and 12:00
rule.dayOfWeek = [1,3,5];
rule.hour = [0,12];
rule.minute = 0;
rule.second = 0;

4. Termination of the task

You can use cancel() to terminate a running task. When an abnormality occurs in a task, cancel the task in time

job.cancel();

Summarize

node-schedule is a crontab module for Node.js. We can use scheduled tasks to maintain the server system, allowing it to perform certain necessary operations at a fixed time period. We can also use scheduled tasks to send emails, crawl data, etc.

This is the end of this article about implementing scheduled crawlers with Nodejs. For more relevant Nodejs scheduled crawlers content, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • How to use nodejs to implement command line games
  • Nodejs realizes the sharing of small games with multiple people moving the mouse online at the same time
  • Implementing a multiplayer game server engine using Node.js
  • Node.js real-time multiplayer game framework
  • Is node.js suitable for game backend development?
  • Differences between this keyword in NodeJS and browsers
  • The core process of nodejs processing tcp connection
  • How to write a Node.JS version of a game

<<:  Join operation in Mysql

>>:  win2008 server security settings deployment document (recommended)

Recommend

Apache ab concurrent load stress test implementation method

ab command principle Apache's ab command simu...

In-depth understanding of the creation and implementation of servlets in tomcat

1. What is a servlet 1.1. Explain in official wor...

Summary of four situations of joint query between two tables in Mysql

Generally speaking, in order to get more complete...

CSS World--Code Practice: Image Alt Information Presentation

Using the <img> element with the default sr...

Summary of pitfalls of using nginx as a reverse proxy for grpc

background As we all know, nginx is a high-perfor...

Linux kernel device driver address mapping notes

#include <asm/io.h> #define ioremap(cookie,...

Pay attention to the use of HTML tags in web page creation

This article introduces some issues about HTML ta...

Detailed explanation of Linux one-line command to process batch files

Preface The best method may not be the one you ca...

Commonly used js function methods in the front end

Table of contents 1. Email 2. Mobile phone number...

Optimal web page width and its compatible implementation method

1. When designing a web page, determining the widt...

Javascript basics about built-in objects

Table of contents 1. Introduction to built-in obj...

How to compile the Linux kernel

1. Download the required kernel version 2. Upload...

Detailed explanation of the getBoundingClientRect() method in js

1. getBoundingClientRect() Analysis The getBoundi...

Realize super cool water light effect based on canvas

This article example shares with you the specific...