A brief discussion on the optimization of MySQL paging for billions of data

A brief discussion on the optimization of MySQL paging for billions of data

background

After get off work, I happily sat on the subway on the way home, thinking about how to plan my weekend life.

Suddenly the phone rang. I saw it was one of our development classmates and I immediately became nervous. This week's version has already been released, and a call at this time generally means that there is a problem on the line.

Sure enough, the communication situation was that an online data query interface was called madly and irrationally, and this operation directly caused the online MySql cluster to be slowed down.
Well, this problem is serious. I got off the subway and rushed home, turned on the computer, and retrieved the slow query log on Pinpoint with my colleagues. I saw a very strange query, as follows

POST domain/v1.0/module/method?order=condition&orderType=desc&offset=1800000&limit=500

domain, module and method are all aliases, representing the domain, module and instance method name of the interface. The following offset and limit represent the offset and number of pages of the paging operation, which means that the student is turning to the (1800000/500+1=3601)th page. After a preliminary search of the logs, I found more than 8,000 such calls.

This is amazing. The number of paginated pages on our page is not 500, but 25 per page. This is definitely not caused by manually turning pages one by one on the function page, but the data is refreshed (for explanation, our production environment data has more than 100 million). A detailed comparison of the logs revealed that many paging times overlapped, and the other party should be a multi-threaded call.

By analyzing the authentication Token, we basically determined that the request came from a client program called ApiAutotest, and that the account that generated the authentication Token came from a QA student. I immediately called my classmates to communicate and deal with the issue.

analyze

In fact, the overall efficiency of our MySQL query statements is still acceptable. The necessary join table query optimizations are all there, the necessary simplified query content is also there, and the necessary indexes for key condition fields and sorting fields are also there. The problem is that it queries page by page, and the further back the page is, the more data is scanned, and the slower it becomes.
When we checked the first few pages, we found that the speed was very fast, for example, limit 200,25 came out instantly. But the speed becomes slower and slower as time goes by, especially after one million records, it becomes extremely stuck. What is the principle behind this? Let's first look at what the query SQL looks like when we turn the page to the back:

select * from t_name where c_name1='xxx' order by c_name2 limit 2000000,25;

The slowness of this query is actually caused by the large offset after limit. For example, the limit 2000000,25 above is equivalent to the database scanning 2000025 pieces of data, then discarding the first 20000000 pieces of data, and returning the remaining 25 pieces of data to the user. This approach is obviously unreasonable.

Let's look at Chapter 6 of "High Performance MySQL": Query Performance Optimization, which explains this issue:

Paging operations are usually implemented using limit plus offset, with an appropriate order by clause. But this presents a common problem: when the offset is very large, it causes MySQL to scan a large number of unnecessary rows and then discard them.

Data simulation

Well, now that we understand the principle of the problem, we should try to solve it. Regarding data sensitivity, we simulate this situation and construct some data for testing.

1. Create two tables: employee table and department table

/*Department table, delete if it exists*/
drop table if EXISTS dep;
create table dep(
    id int unsigned primary key auto_increment,
    depno mediumint unsigned not null default 0,
    depname varchar(20) not null default "",
    memo varchar(200) not null default ""
);

/*Employee table, delete if exists*/
drop table if EXISTS emp;
create table emp(
    id int unsigned primary key auto_increment,
    empno mediumint unsigned not null default 0,
    empname varchar(20) not null default "",
    job varchar(9) not null default "",
    mgr mediumint unsigned not null default 0,
    hiredate datetime not null,
    sal decimal(7,2) not null,
    comn decimal(7,2) not null,
    depno mediumint unsigned not null default 0
);

2. Create two functions: generate random strings and random numbers

/* Function to generate random string */
DELIMITER $
drop FUNCTION if EXISTS rand_string;
CREATE FUNCTION rand_string(n INT) RETURNS VARCHAR(255)
BEGIN
    DECLARE chars_str VARCHAR(100) DEFAULT 'abcdefghijklmlopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
    DECLARE return_str VARCHAR(255) DEFAULT '';
    DECLARE i INT DEFAULT 0;
    WHILE i < n DO
    SET return_str = CONCAT(return_str,SUBSTRING(chars_str,FLOOR(1+RAND()*52),1));
    SET i = i+1;
    END WHILE;
    RETURN return_str;
END $
DELIMITER;


/*Function to generate random department number*/
DELIMITER $
drop FUNCTION if EXISTS rand_num;
CREATE FUNCTION rand_num() RETURNS INT(5)
BEGIN
    DECLARE i INT DEFAULT 0;
    SET i = FLOOR(100+RAND()*10);
    RETURN i;
END $
DELIMITER;

3. Write a stored procedure to simulate 5 million employee data

/*Create a stored procedure: insert data into the emp table*/
DELIMITER $
drop PROCEDURE if EXISTS insert_emp;
CREATE PROCEDURE insert_emp(IN START INT(10),IN max_num INT(10))
BEGIN
    DECLARE i INT DEFAULT 0;
    /*set autocommit =0 Set autocommit to 0 and turn off the default commit*/
    SET autocommit = 0;
    REPEAT
    SET i = i + 1;
    INSERT INTO emp(empno,empname,job,mgr,hiredate,sal,comn,depno) VALUES ((START+i),rand_string(6),'SALEMAN',0001,now(),2000,400,rand_num());
    UNTIL i = max_num
    END REPEAT;
    COMMIT;
END $
DELIMITER;
/*Insert 5 million pieces of data*/
call insert_emp(0,5000000);

4. Write a stored procedure to simulate the department data of 120

/*Create a stored procedure: insert data into the dep table*/
DELIMITER $
drop PROCEDURE if EXISTS insert_dept;
CREATE PROCEDURE insert_dept(IN START INT(10), IN max_num INT(10))
BEGIN
    DECLARE i INT DEFAULT 0;
    SET autocommit = 0;
    REPEAT
    SET i = i+1;
    INSERT INTO dep( depno,depname,memo) VALUES((START+i),rand_string(10),rand_string(8));
    UNTIL i = max_num
    END REPEAT;
    COMMIT;
END $
DELIMITER;
/*Insert 120 records*/
call insert_dept(1,120);

5. Create an index for the key field. This will take a long time to create the index after running the data, but running the data will be faster.

/*Create index for key fields: sorting, conditions*/
CREATE INDEX idx_emp_id ON emp(id);
CREATE INDEX idx_emp_depno ON emp(depno);
CREATE INDEX idx_dep_depno ON dep(depno);

test

Test Data

/*Offset is 100, take 25*/
SELECT a.empno,a.empname,a.job,a.sal,b.depno,b.depname
from emp a left join dep b on a.depno = b.depno order by a.id desc limit 100,25;
/*Offset is 4800000, take 25*/
SELECT a.empno,a.empname,a.job,a.sal,b.depno,b.depname
from emp a left join dep b on a.depno = b.depno order by a.id desc limit 4800000,25;

Execution Results

[SQL]
SELECT a.empno,a.empname,a.job,a.sal,b.depno,b.depname
from emp a left join dep b on a.depno = b.depno order by a.id desc limit 100,25;
Affected rows: 0
Time: 0.001s
[SQL]
SELECT a.empno,a.empname,a.job,a.sal,b.depno,b.depname
from emp a left join dep b on a.depno = b.depno order by a.id desc limit 4800000,25;
Affected rows: 0
Time: 12.275s

Because there is a lot of data to scan, this is obviously not an order of magnitude more time-consuming.

Solution

1. Use index coverage + subquery optimization

Because we have the primary key id and have built an index on it, we can first find the id value of the starting position in the index tree, and then query the row data based on the found id value.

/*Subquery gets the id of the position offset by 100, and gets 25 after this position*/
SELECT a.empno,a.empname,a.job,a.sal,b.depno,b.depname
from emp a left join dep b on a.depno = b.depno
where a.id >= (select id from emp order by id limit 100,1)
order by a.id limit 25;

/*Subquery gets the id of the position offset by 4800000, and gets 25 after this position*/
SELECT a.empno,a.empname,a.job,a.sal,b.depno,b.depname
from emp a left join dep b on a.depno = b.depno
where a.id >= (select id from emp order by id limit 4800000,1)
order by a.id limit 25;

Execution Results

The execution efficiency has been greatly improved compared to before:
[SQL]
SELECT a.empno,a.empname,a.job,a.sal,b.depno,b.depname
from emp a left join dep b on a.depno = b.depno
where a.id >= (select id from emp order by id limit 100,1)
order by a.id limit 25;
Affected rows: 0
Time: 0.106s

[SQL]
SELECT a.empno,a.empname,a.job,a.sal,b.depno,b.depname
from emp a left join dep b on a.depno = b.depno
where a.id >= (select id from emp order by id limit 4800000,1)
order by a.id limit 25;
Affected rows: 0
Time: 1.541s

2. Redefine the starting position

Remember the primary key position of the last search result to avoid using offset

/*Remember that the id of the last data in the previous paging is 100, so we will skip 100 and scan the table from 101*/
SELECT a.id,a.empno,a.empname,a.job,a.sal,b.depno,b.depname
from emp a left join dep b on a.depno = b.depno
where a.id > 100 order by a.id limit 25;

/*Remember that the id of the last data in the previous paging is 4800000, so we will skip 4800000 and scan the table from 4800001*/
SELECT a.id,a.empno,a.empname,a.job,a.sal,b.depno,b.depname
from emp a left join dep b on a.depno = b.depno
where a.id > 4800000
order by a.id limit 25;

Execution Results

[SQL]
SELECT a.id,a.empno,a.empname,a.job,a.sal,b.depno,b.depname
from emp a left join dep b on a.depno = b.depno
where a.id > 100 order by a.id limit 25;
Affected rows: 0
Time: 0.001s

[SQL]
SELECT a.id,a.empno,a.empname,a.job,a.sal,b.depno,b.depname
from emp a left join dep b on a.depno = b.depno
where a.id > 4800000
order by a.id limit 25;
Affected rows: 0
Time: 0.000s

This is the most efficient. No matter how the pages are divided, the time consumed is basically the same, because after executing the conditions, only 25 pieces of data are scanned.

But there is a problem, it is only suitable for paging one page at a time, so that the last ID of the previous page can be remembered. There will be problems if the user jumps between pages. For example, if the user just finishes browsing page 25 and immediately jumps to page 35, the data will be incorrect.

This is suitable for scenarios like Baidu search or Tencent News, where you pull the scroll wheel down and continuously pull and load. This lazy loading ensures that data is not fetched in leaps and bounds.

3. Downgrade strategy

I saw a solution shared by an Alibaba DBA classmate online: configure the limit offset and the maximum value of the number of acquisitions. If the maximum value is exceeded, empty data is returned.
Because he thinks that if the value exceeds this, you are no longer paging, but refreshing data. If you are sure you want to find data, you should enter appropriate conditions to narrow the scope, rather than paging page by page.
This is roughly the same idea as my colleague's: if the offset is greater than a certain value during the request, a 4xx error is returned first.

summary

That night we applied the third solution mentioned above to limit the offset. If it exceeds a certain value, a null value is returned. On the second day, the program and database scripts were further optimized using the first and second solutions in combination.

Reasonably speaking, extreme situations should be considered when doing any function, and the design capacity should cover extreme boundary tests.

In addition, necessary current limiting and downgrading should also be taken into consideration. For example, if a tool is called multi-threadedly and the frequency is 8,000 times in a short period of time, the counting service can be used to determine and feedback that the user's calls are too frequent, and the calls can be directly interrupted.

This is the end of this article about the optimization of MySQL billion-level data paging. For more relevant MySQL billion-level data paging content, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • How to quickly clean up billions of data in MySQL database
  • How to use partitioning to optimize MySQL data processing for billions of data

<<:  Detailed explanation of Vue's custom event content distribution

>>:  Several ways to solve the 1px border problem on mobile devices (5 methods)

Recommend

Detailed explanation of jQuery chain calls

Table of contents Chain calls A small case Chain ...

Implementing password box verification information based on JavaScript

This article example shares the specific code of ...

Create a movable stack widget function using flutter

This post focuses on a super secret Flutter proje...

Implementation of CSS scroll bar style settings

webkit scrollbar style reset 1. The scrollbar con...

How to choose the right MySQL datetime type to store your time

When building a database and writing a program, i...

Graphical explanation of the solutions for front-end processing of small icons

Preface Before starting this article, let’s do a ...

How are spaces represented in HTML (what do they mean)?

In web development, you often encounter characters...

vue+echarts realizes the flow effect of China map (detailed steps)

@vue+echarts realizes the flow effect of China ma...

Analysis of the principles and usage of Linux hard links and soft links

In the Linux system, there is a kind of file call...

Summary of Docker common commands and tips

Installation Script Ubuntu / CentOS There seems t...

Detailed Analysis of or, in, union and Index Optimization in MySQL

This article originated from the homework assignm...

Super detailed teaching on how to upgrade the version of MySQL

Table of contents 1. Introduction 2. Back up the ...