A complete example of implementing a timed crawler with Nodejs

A complete example of implementing a timed crawler with Nodejs

Cause of the incident

A few days ago, I had to help my friend review the Captain Group of Bilibili. Searching the Captain List one by one is naturally not the first choice for a programmer. The right thing to do is to hand over the task to the computer and let it do it by itself. Theory established, start coding.

Since the API crawler of the known captain list uses Axios to directly access the interface

So I spent a little time to write this crawler, which I called bilibili-live-captain-tools 1.0

const axios = require('axios')
const roomid = "146088"
const ruid = "642922"
const url = `https://api.live.bilibili.com/xlive/app-room/v2/guardTab/topList?roomid=${roomid}&ruid=${ruid}&page_size=30`

const Captin = {
 1: 'Governor',
 2: 'Admiral',
 3: 'Captain'
}

const reqPromise = url => axios.get(url);

let CaptinList = []
let UserList = []

async function crawler(URL, pageNow) {
 const res = await reqPromise(URL);
 if (pageNow == 1) {
 CaptinList = CaptinList.concat(res.data.data.top3);
 }
 CaptinList = CaptinList.concat(res.data.data.list);
}


function getMaxPage(res) {

 const Info = res.data.data.info
 const { page: maxPage } = Info
 return maxPage
}


function getUserList(res) {

 for (let item of res) {
 const userInfo = item
 const { uid, username, guard_level } = userInfo
 UserList.push({ uid, username, Captin: Captin[guard_level] })
 }
}

async function main(UID) {
 const maxPage = await reqPromise(`${url}&page=1`).then(getMaxPage)
 for (let pageNow = 1; pageNow < maxPage + 1; pageNow++) {
 const URL = `${url}&page=${pageNow}`;
 await crawler(URL, pageNow);
 }
 getUserList(CaptinList)
 console.log(search(UID, UserList))
 return search(UID, UserList)
}

function search(uid, UserList) {
 for (let i = 0; i < UserList.length; i++) {
 if (UserList[i].uid === uid) {
 return UserList[i];
 }
 }
 return 0
}

module.exports = {
 main
}

Obviously, this crawler can only be triggered manually, and it needs a command line and node environment to run directly, so I opened a page service for it with Koa2 and wrote an extremely simple page

const Koa = require('koa');
const app = new Koa();
const path = require('path')
const fs = require('fs');
const router = require('koa-router')();
const index = require('./index')
const views = require('koa-views')



app.use(views(path.join(__dirname, './'), {
 extension: 'ejs'
}))
app.use(router.routes());

router.get('/', async ctx => {
 ctx.response.type = 'html';
 ctx.response.body = fs.createReadStream('./index.html');
})

router.get('/api/captin', async (ctx) => {
 const UID = ctx.request.query.uid
 console.log(UID)
 const Info = await index.main(parseInt(UID))
 await ctx.render('index', {
 Info,
 })
});

app.listen(3000);

Since the page has no throttling and anti-shake, the current version can only be crawled in real time, the waiting time is long, and frequent refreshes will naturally trigger B station's anti-crawler mechanism, so the current server IP is subject to risk control.

So bilibili-live-captain-tools 2.0 was born

function throttle(fn, delay) {
 var timer;
 return function () {
 var _this = this;
 var args = arguments;
 if (timer) {
  return;
 }
 timer = setTimeout(function () {
  fn.apply(_this, args);
  timer = null; // Clear the timer after executing fn after delay. At this time, timer is false and throttle trigger can enter the timer}, delay)
 }
}

Add throttling and anti-shake, and use pseudo real-time crawler (crawl once a minute through scheduled tasks)

In this case, we need to execute the crawler script regularly. At this time, I thought of using the schedule function of egg, but I don’t want to make a crawler program so "overkill". When I am in doubt, I just search on Baidu. So we have the following plan

Use Node Schedule to implement scheduled tasks

Node Schedule is a flexible cron and non-cron job scheduler for Node.js. It allows you to schedule a job (an arbitrary function) to be executed on specific dates, with optional recurrence rules. It only uses one timer at any given time (instead of re-evaluating upcoming jobs every second/minute).

1. Install node-schedule

npm install node-schedule
# or yarn add node-schedule

2. Basic Usage

Let’s take a look at the official examples.

const schedule = require('node-schedule');

const job = schedule.scheduleJob('42 * * * *', function(){
 console.log('The answer to life, the universe, and everything!');
});

The first parameter of schedule.scheduleJob needs to be entered according to the following rules

Node Schedule rules are shown in the following table

* * * * * *
┬ ┬ ┬ ┬ ┬ ┬
│ │ │ │ │ |
│ │ │ │ │ └ Day of the week, value range: 0 - 7, where 0 and 7 both represent Sunday │ │ │ │ └─── Month, value range: 1 - 12
│ │ │ └────── Date, value: 1 - 31
│ │ └───────── , value: 0 - 23
│ └──────────── points, value: 0 - 59
└─────────────── seconds, value: 0 - 59 (optional)
You can also specify a specific time, such as: const date = new Date()

Understand the rules and implement one yourself

const schedule = require('node-schedule');

// Define a time let date = new Date(2021, 3, 10, 12, 00, 0);

// Define a task let job = schedule.scheduleJob(date, () => {
 console.log("Current time:",new Date());
});

The above example means that the time will be reported at 12:00 on March 10, 2021.

3. Advanced Usage

In addition to the basic usage, we can also use some more flexible methods to implement scheduled tasks.

3.1. Execute once every minute

const schedule = require('node-schedule');

// Define rules let rule = new schedule.RecurrenceRule();
rule.second = 0
//Execute once every minute at 0 seconds //Start the task let job = schedule.scheduleJob(rule, () => {
 console.log(new Date());
});

The rule supports the following values: second, minute, hour, date, dayOfWeek, month, year, etc.

Some common rules are shown in the following table:

Executions per second
rule.second = [0,1,2,3......59];
Execute every minute at 0 seconds
rule.second = 0;
Execute every 30 minutes
rule.minute = 30;
rule.second = 0;
Executed at 0:00 every day
rule.hour =0;
rule.minute =0;
rule.second =0;
Executed at 10:00 on the 1st of every month
rule.date = 1;
rule.hour = 10;
rule.minute = 0;
rule.second = 0;
Executed every Monday, Wednesday, and Friday at 0:00 and 12:00
rule.dayOfWeek = [1,3,5];
rule.hour = [0,12];
rule.minute = 0;
rule.second = 0;

4. Termination of the task

You can use cancel() to terminate a running task. When an abnormality occurs in a task, cancel the task in time

job.cancel();

Summarize

node-schedule is a crontab module for Node.js. We can use scheduled tasks to maintain the server system, allowing it to perform certain necessary operations at a fixed time period. We can also use scheduled tasks to send emails, crawl data, etc.

This is the end of this article about implementing scheduled crawlers with Nodejs. For more relevant Nodejs scheduled crawlers content, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • How to use nodejs to implement command line games
  • Nodejs realizes the sharing of small games with multiple people moving the mouse online at the same time
  • Implementing a multiplayer game server engine using Node.js
  • Node.js real-time multiplayer game framework
  • Is node.js suitable for game backend development?
  • Differences between this keyword in NodeJS and browsers
  • The core process of nodejs processing tcp connection
  • How to write a Node.JS version of a game

<<:  Join operation in Mysql

>>:  win2008 server security settings deployment document (recommended)

Recommend

Details of 7 kinds of component communication in Vue3

Table of contents 1. Vue3 component communication...

MySQL uses variables to implement various sorting

Core code -- Below I will demonstrate the impleme...

Vue uses plug-ins to cut pictures in proportion

This article shares the specific code of Vue usin...

Useful codes for web page creation

<br />How can I remove the scroll bar on the...

Vue implements the frame rate playback of the carousel

This article example shares the specific code of ...

Pure CSS to achieve a single div regular polygon transformation

In the previous article, we introduced how to use...

Web developers are concerned about the coexistence of IE7 and IE8

I installed IE8 today. When I went to the Microso...

Graphical introduction to the difference between := and = in MySQL

The difference between := and = = Only when setti...

The whole process record of introducing Vant framework into WeChat applet

Preface Sometimes I feel that the native UI of We...

Getting Started with MySQL - Concepts

1. What is it? MySQL is the most popular relation...

How to invert the implementation of a Bezier curve in CSS

First, let’s take a look at a CSS carousel animat...

How to install MySQL and MariaDB in Docker

Relationship between MySQL and MariaDB MariaDB da...

Scary Halloween Linux Commands

Even though it's not Halloween, it's wort...