A complete example of implementing a timed crawler with Nodejs

Cause of the incident
Use Node Schedule to implement scheduled tasks

1. Install node-schedule
2. Basic Usage
3. Advanced Usage
4. Termination of the task

Summarize

Cause of the incident

A few days ago, I had to help my friend review the Captain Group of Bilibili. Searching the Captain List one by one is naturally not the first choice for a programmer. The right thing to do is to hand over the task to the computer and let it do it by itself. Theory established, start coding.

Since the API crawler of the known captain list uses Axios to directly access the interface

So I spent a little time to write this crawler, which I called bilibili-live-captain-tools 1.0

const axios = require('axios')
const roomid = "146088"
const ruid = "642922"
const url = `https://api.live.bilibili.com/xlive/app-room/v2/guardTab/topList?roomid=${roomid}&ruid=${ruid}&page_size=30`

const Captin = {
 1: 'Governor',
 2: 'Admiral',
 3: 'Captain'
}

const reqPromise = url => axios.get(url);

let CaptinList = []
let UserList = []

async function crawler(URL, pageNow) {
 const res = await reqPromise(URL);
 if (pageNow == 1) {
 CaptinList = CaptinList.concat(res.data.data.top3);
 }
 CaptinList = CaptinList.concat(res.data.data.list);
}


function getMaxPage(res) {

 const Info = res.data.data.info
 const { page: maxPage } = Info
 return maxPage
}


function getUserList(res) {

 for (let item of res) {
 const userInfo = item
 const { uid, username, guard_level } = userInfo
 UserList.push({ uid, username, Captin: Captin[guard_level] })
 }
}

async function main(UID) {
 const maxPage = await reqPromise(`${url}&page=1`).then(getMaxPage)
 for (let pageNow = 1; pageNow < maxPage + 1; pageNow++) {
 const URL = `${url}&page=${pageNow}`;
 await crawler(URL, pageNow);
 }
 getUserList(CaptinList)
 console.log(search(UID, UserList))
 return search(UID, UserList)
}

function search(uid, UserList) {
 for (let i = 0; i < UserList.length; i++) {
 if (UserList[i].uid === uid) {
 return UserList[i];
 }
 }
 return 0
}

module.exports = {
 main
}

Obviously, this crawler can only be triggered manually, and it needs a command line and node environment to run directly, so I opened a page service for it with Koa2 and wrote an extremely simple page

const Koa = require('koa');
const app = new Koa();
const path = require('path')
const fs = require('fs');
const router = require('koa-router')();
const index = require('./index')
const views = require('koa-views')



app.use(views(path.join(__dirname, './'), {
 extension: 'ejs'
}))
app.use(router.routes());

router.get('/', async ctx => {
 ctx.response.type = 'html';
 ctx.response.body = fs.createReadStream('./index.html');
})

router.get('/api/captin', async (ctx) => {
 const UID = ctx.request.query.uid
 console.log(UID)
 const Info = await index.main(parseInt(UID))
 await ctx.render('index', {
 Info,
 })
});

app.listen(3000);

Since the page has no throttling and anti-shake, the current version can only be crawled in real time, the waiting time is long, and frequent refreshes will naturally trigger B station's anti-crawler mechanism, so the current server IP is subject to risk control.

So bilibili-live-captain-tools 2.0 was born

function throttle(fn, delay) {
 var timer;
 return function () {
 var _this = this;
 var args = arguments;
 if (timer) {
  return;
 }
 timer = setTimeout(function () {
  fn.apply(_this, args);
  timer = null; // Clear the timer after executing fn after delay. At this time, timer is false and throttle trigger can enter the timer}, delay)
 }
}

Add throttling and anti-shake, and use pseudo real-time crawler (crawl once a minute through scheduled tasks)

In this case, we need to execute the crawler script regularly. At this time, I thought of using the schedule function of egg, but I don’t want to make a crawler program so "overkill". When I am in doubt, I just search on Baidu. So we have the following plan

Use Node Schedule to implement scheduled tasks

Node Schedule is a flexible cron and non-cron job scheduler for Node.js. It allows you to schedule a job (an arbitrary function) to be executed on specific dates, with optional recurrence rules. It only uses one timer at any given time (instead of re-evaluating upcoming jobs every second/minute).

1. Install node-schedule

npm install node-schedule
# or yarn add node-schedule

2. Basic Usage

Let’s take a look at the official examples.

const schedule = require('node-schedule');

const job = schedule.scheduleJob('42 * * * *', function(){
 console.log('The answer to life, the universe, and everything!');
});

The first parameter of schedule.scheduleJob needs to be entered according to the following rules

Node Schedule rules are shown in the following table

* * * * * *
┬ ┬ ┬ ┬ ┬ ┬
│ │ │ │ │ |
│ │ │ │ │ └ Day of the week, value range: 0 - 7, where 0 and 7 both represent Sunday │ │ │ │ └─── Month, value range: 1 - 12
│ │ │ └────── Date, value: 1 - 31
│ │ └───────── , value: 0 - 23
│ └──────────── points, value: 0 - 59
└─────────────── seconds, value: 0 - 59 (optional)
You can also specify a specific time, such as: const date = new Date()

Understand the rules and implement one yourself

const schedule = require('node-schedule');

// Define a time let date = new Date(2021, 3, 10, 12, 00, 0);

// Define a task let job = schedule.scheduleJob(date, () => {
 console.log("Current time:",new Date());
});

The above example means that the time will be reported at 12:00 on March 10, 2021.

3. Advanced Usage

In addition to the basic usage, we can also use some more flexible methods to implement scheduled tasks.

3.1. Execute once every minute

const schedule = require('node-schedule');

// Define rules let rule = new schedule.RecurrenceRule();
rule.second = 0
//Execute once every minute at 0 seconds //Start the task let job = schedule.scheduleJob(rule, () => {
 console.log(new Date());
});

The rule supports the following values: second, minute, hour, date, dayOfWeek, month, year, etc.

Some common rules are shown in the following table:

Executions per second
rule.second = [0,1,2,3......59];
Execute every minute at 0 seconds
rule.second = 0;
Execute every 30 minutes
rule.minute = 30;
rule.second = 0;
Executed at 0:00 every day
rule.hour =0;
rule.minute =0;
rule.second =0;
Executed at 10:00 on the 1st of every month
rule.date = 1;
rule.hour = 10;
rule.minute = 0;
rule.second = 0;
Executed every Monday, Wednesday, and Friday at 0:00 and 12:00
rule.dayOfWeek = [1,3,5];
rule.hour = [0,12];
rule.minute = 0;
rule.second = 0;

4. Termination of the task

You can use cancel() to terminate a running task. When an abnormality occurs in a task, cancel the task in time

job.cancel();

Summarize

node-schedule is a crontab module for Node.js. We can use scheduled tasks to maintain the server system, allowing it to perform certain necessary operations at a fixed time period. We can also use scheduled tasks to send emails, crawl data, etc.

This is the end of this article about implementing scheduled crawlers with Nodejs. For more relevant Nodejs scheduled crawlers content, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:

How to use nodejs to implement command line games
Nodejs realizes the sharing of small games with multiple people moving the mouse online at the same time
Implementing a multiplayer game server engine using Node.js
Node.js real-time multiplayer game framework
Is node.js suitable for game backend development?
Differences between this keyword in NodeJS and browsers
The core process of nodejs processing tcp connection
How to write a Node.JS version of a game

<<: Join operation in Mysql

>>: win2008 server security settings deployment document (recommended)

How to underline the a tag and change the color before and after clicking

A complete example of implementing a timed crawler with Nodejs

Table of contents

Cause of the incident

Use Node Schedule to implement scheduled tasks

1. Install node-schedule

2. Basic Usage

3. Advanced Usage

4. Termination of the task

Summarize

How to underline the a tag and change the color before and after clicking

The neglected special effects of META tags (page transition effects)

40 fonts recommended for famous website logos

Examples of the correct way to use AES_ENCRYPT() and AES_DECRYPT() to encrypt and decrypt MySQL

How to use domestic image warehouse for Docker

Implementation of Nginx hot deployment

Sample code for implementing radar chart with vue+antv

Analysis and solution of the reasons why crontab scheduled tasks are not executed

Introduction to HTML DOM_PowerNode Java Academy

How to quickly import data into MySQL

Recommend

Details of 7 kinds of component communication in Vue3

MySQL uses variables to implement various sorting

Vue uses plug-ins to cut pictures in proportion

Useful codes for web page creation

Vue implements the frame rate playback of the carousel

Pure CSS to achieve a single div regular polygon transformation

Detailed explanation of CSS to achieve the effect of illuminating the border by imitating the Windows 10 mouse

Web developers are concerned about the coexistence of IE7 and IE8

Graphical introduction to the difference between := and = in MySQL

The whole process record of introducing Vant framework into WeChat applet

Getting Started with MySQL - Concepts

How to invert the implementation of a Bezier curve in CSS

Detailed steps for setting up the network for the virtual machine that comes with win10 (graphic tutorial)

How to install MySQL and MariaDB in Docker

Scary Halloween Linux Commands