Solution to the long delay of MySQL database master-slave replication

Solution to the long delay of MySQL database master-slave replication

Preface

The delay of MySQL master-slave replication has been a long-standing problem in the industry. The occurrence of delays will reduce the value of master-slave read-write separation, which is not conducive to the use of MySQL in businesses with high data real-time requirements.

UDB is a cloud database service launched by UCloud. It has been online for six years and operates tens of thousands of UDB MySQL instances. In addition to providing high availability, high performance, and easy-to-use product features, the team also helps users solve 2-3 MySQL instance master-slave replication delay issues on average every day. From a lot of practice, we have summarized various causes and solutions for master-slave replication delays, and now share them here.

The importance of latency issues

The master-slave replication mechanism is widely used in the internal implementation of UDB: the slave and master databases created by UDB adopt the "master-slave replication" data replication; in addition, UDB's flagship product "UDB MySQL High Availability Instance" also adopts the "dual-master mode" in which two databases are mutually master-slave to replicate data, and the core of the dual-master mode is the master-slave replication mechanism.

If there is a delay between master-slave replication, the consistency of master-slave data will be affected.

In the high-availability replication scenario, we have taken into account in the UDB high-availability disaster recovery design that if the primary and standby data are inconsistent, high-availability disaster recovery switching is not allowed by default. Because when the primary and standby data are inconsistent, disaster recovery switching occurs and data is written to the new primary database, unexpected serious consequences will occur from a business perspective.

The replication delay problem not only brings adverse consequences in UDB high availability, but also in the scenario of read-only slave databases, if the slave database has replication delay, it may also have a certain impact on the business. For example, it manifests itself in inconsistent reading and writing in the business - the newly added/modified data cannot be found, etc.

This shows that the delay problem of master-slave replication requires special attention in database operations. Typically, the DBA executes 'SHOW SLAVE STATUS' on the library and observes

The value of 'Seconds_Behind_Master' can help you understand the data replication delay between a current database and its master database. This value is so important that we extracted it separately on the UDB monitoring interface and designed the "slave synchronization delay" monitoring item so that operation and maintenance personnel can observe it directly on the console.

Analysis and solution of delay problem in production environment

We have summarized the most common master-slave replication delay cases into several categories. The following is a summary of the phenomenon description, cause analysis, and solutions for the relevant cases.

◆ Case 1: Frequent DML requests on the main database

During business peak periods, some users may experience master-slave replication delays, especially when there are a large number of write request operations on the database master, that is, a large number of concurrent operations such as insert, delete, and update.

Description of the phenomenon

By observing the QPS value of the master database's write operations, we can see that the QPS value of the master database's write operations suddenly increases, accompanied by an increase in the master-slave replication delay. This can be judged to be caused by frequent DML requests to the master database.

As shown in the figure above, the QPS suddenly increased at around 17:58, and the write-related QPS on the console also increased accordingly. When QPS increases suddenly, the corresponding latency also gradually increases, as shown in the following figure.

Cause Analysis

After analysis, we believe that this is due to a large number of write request operations on the main database, which generated a large amount of binlog in a short period of time. These operations need to be synchronized to the slave database and executed, resulting in a delay in data replication between the master and slave.

A deeper analysis of the cause shows that during the peak business period, the master database writes data concurrently, while the slave database SQL Thread plays back the binlog log in a single thread, which can easily cause relaylog accumulation and delay.

Solution

If the version is MySQL 5.7 or earlier, you can use sharding to break up the write requests by scaling out horizontally, thus increasing the parallelism of write requests to the binlog.

If it is MySQL 5.7 or later, in MySQL 5.7, parallel replication based on logical clock (Group Commit) is used. In MySQL 8.0, parallel replication based on Write Set is used. Both solutions can improve the performance of binlog playback and reduce latency.

◆ Case 2: The main database executes a large transaction

A large transaction refers to the execution of a transaction that takes a very long time. Common statements that generate large transactions are:

A large number of slow data import statements are used, such as INSERT INTO $tb, SELECT * FROM $tb, LOAD DATA INFILE, etc.
Use UPDATE and DELETE statements to perform UPDATE and DELETE on a large table.
When this transaction is replayed on the slave database, a master-slave replication delay may occur.

Description of the phenomenon

We analyze the results of SHOW SLAVE STATUS and find that the Exec_Master_Log_Pos field has not changed, and the second_behinds_master continues to increase, while the value of the Slave_SQL_Running_State field is "Reading event from the relay log". At the same time, by analyzing the master database binlog and looking at the transactions currently executed by the master database, we can find some large transactions. This basically determines that the master-slave replication delay is caused by the execution of large transactions.

Cause Analysis

After a large transaction is recorded in the binlog and synchronized to the slave database, it takes a very long time for the slave database to execute the transaction. During this period, master-slave replication delay will occur.

For example, if the master database takes 200 seconds to update a large table, and the master and slave databases have similar configurations, the slave database will also need to spend almost the same amount of time to update the large table. At this time, slave database delays begin to accumulate, and subsequent events cannot be updated.

Solution

Our improvement method for the master-slave replication delay caused by this situation is to split the large transaction statement into several small transactions, so that they can be committed in time and reduce the master-slave replication delay.

◆ Case 3: The main database executes DDL statements on a large table

DDL stands for Data Definition Language, which refers to some statements that modify the table structure, such as adding a field or an index to the table. When DDL statements are executed on a large table in the master database, master-slave replication delay may occur.

Description of the phenomenon

From the phenomenon, if the output of SHOW SLAVE STATUS executed by the slave library shows that Exec_Master_Log_Pos has not changed, and the master library is not executing large transactions, then it is possible that the DDL of a large table is being executed. This can be confirmed by analyzing the main database binlog and looking at the transactions currently executed by the main database.

The execution of DDL statements can be further broken down into the following:

1. DDL is not started and is blocked. In this case, the result of SHOW SLAVE STATUS shows that Slave_SQL_Running_State is waiting for table metadata lock, and Exec_Master_Log_Pos remains unchanged.

2. The DDL is being executed and SQL Thread single-threaded application causes increased latency. In this case, by observing the results of SHOW SLAVE STATU, you can find that Slave_SQL_Running_State is altering table, while Exec_Master_Log_Pos remains unchanged.

If the above phenomenon occurs, it is very likely that the master database executes DDL statements on the large table, synchronizes them to the slave database, and replays them in the slave database, resulting in master-slave replication delay.

Cause Analysis

The reason for the master-slave replication delay caused by DDL is similar to that of large transactions. It is also because the slave library executes the binlog of DDL slowly, which causes the master-slave replication delay.

Solution

In this case, we mainly use SHOW PROCESSLIST or query information_schema.innodb_trx to find the blocking DDL statement and KILL the related queries to allow DDL to execute normally in the slave database.

The delay caused by DDL itself is difficult to avoid. It is recommended to consider:

Avoid business peaks and try to schedule execution during business off-peak periods;

After setting sql_log_bin=0, manually execute DDL on the master and slave databases respectively (this operation may cause data inconsistency for some DDL operations, so be sure to test it strictly). If the user uses the cloud database UDB, you can contact the UCloud UDB operation and maintenance team for assistance.

◆ Case 4: Configuration inconsistency between the master and slave databases

If the master and slave libraries use different computing and storage resources, or different kernel tuning parameters, the master and slave may be inconsistent.

Description of the phenomenon

We will compare the performance monitoring data of the master and slave databases in detail. If we find that the monitoring data is very different, we can make a clear judgment by checking the various configurations of the master and slave databases.

Cause Analysis

Differences in the configuration of various hardware or resources may lead to performance differences between the master and slave, resulting in master-slave replication delays:

Hardware: For example, if the master database instance server uses SSD disks, while the slave database instance server uses ordinary SAS disks, the write operations generated by the master database cannot be digested immediately on the slave database, resulting in master-slave replication delay;
Configuration: For example, inconsistent RAID card write strategies, inconsistent OS kernel parameter settings, inconsistent MySQL disk placement strategies, etc. are all possible reasons.

Solution

Consider unifying the configuration of DB machines (including hardware and option parameters) as much as possible. Even for some OLAP businesses, the hardware configuration of the slave database instance needs to be slightly higher than that of the master database.

◆ Case 5: The table lacks a primary key or a suitable index

If the database table lacks a primary key or a suitable index, master-slave replication delay may occur when the binlog_format of the master-slave replication is set to 'row'.

Description of the phenomenon

When we check the database, we find:

Observe the output of SHOW SLAVE STATUS and find that Slave_SQL_Running_State is Reading event from the relay log;

SHOW OPEN TABLES WHERE The table with in_use=1 always exists;

Observe that the Exec_Master_Log_Pos field of SHOW SLAVE STATUS remains unchanged;

The CPU usage of the mysqld process is close to 100% (when there is no read service), and the IO pressure is not great.

When these phenomena occur, it can be considered that there is a high probability that the table lacks a primary key or unique index.

Cause Analysis

When the binlog_format of the master-slave replication is set to 'row', for example, there is a scenario where the master database updates 200,000 rows of data in a table of 5 million. In row format, binlog records 200,000 update operations, which means that each operation updates one record. If this statement happens to have a bad execution plan, such as a full table scan, then each update statement requires a full table scan. At this time, SQL Thread replay will be extremely slow, causing serious master-slave replication delays.

Solution

In this case, we will check the table structure, ensure that each table has an explicit auto-increment primary key, and assist users in creating appropriate indexes.

◆ Case 6: The slave’s own pressure is too high

Sometimes, when the performance pressure of the slave database is very high, it cannot keep up with the update speed of the master database, resulting in master-slave replication delay.

Description of the phenomenon

When observing the database instance, you may find that the CPU load is too high and the IO utilization is too high, which causes the SQL Thread application to be too slow. In this way, it can be determined that the delay in master-slave replication is caused by excessive pressure on the slave library itself.

Cause Analysis

Some UCloud users use the read-write separation mode for the master and slave databases, and most read requests are executed on the slave database. In a scenario where the business has a large number of read requests, the slave database will generate much greater performance pressure than the master database. Some users even run OLAP businesses that consume a lot of computing resources on the slave database, which also poses higher performance challenges to the slave database, which will cause delays in master-slave replication.

Solution

In this case, we recommend that users create more slaves to distribute read requests and reduce the pressure on existing slave instances. For OLAP business, you can set up a slave database specifically for OLAP business, and allow appropriate master-slave replication delay for this slave database.

Summarize

When using MySQL's master-slave replication mode for data replication, the master-slave replication delay is a key factor that needs to be considered. It will affect the consistency of the data and thus affect the high availability disaster recovery switching of the database.

When encountering a master-slave replication delay between databases, our team has summarized the following methods and processes based on past experience to help troubleshoot the problem:

Use SHOW SLAVE STATUS and SHOW PROCESSLIST to view the current status of the slave library. (By the way, similar reasons can also be ruled out when backing up from the library);

If Exec_Master_Log_Pos does not change, consider large transactions, DDL, and no primary key, and check the binlog and position corresponding to the master database;

If Exec_Master_Log_Pos changes and the delay gradually increases, consider the load on the slave machine, such as IO, CPU, etc., and consider whether the pressure on the master write operation and the slave itself is too great.

This article comes from: UCloud Technology, and is shared by UCloud senior experts Ding Shun & Zhang Suning.

Well, that’s all for this article. I hope the content of this article will be of certain reference value to your study or work. Thank you for your support of 123WORDPRESS.COM.

You may also be interested in:
  • MySQL master-slave synchronization mechanism and synchronization delay problem tracking process
  • MySQL database backup setting delayed backup method (MySQL master-slave configuration)
  • Mysql master-slave delay diagram method

<<:  Vue Element-ui form validation rule implementation

>>:  Graphic tutorial on configuring nginx file server in windows 10 system

Recommend

Example analysis of mysql user rights management

This article describes the MySQL user rights mana...

MySQL Series 3 Basics

Table of contents Tutorial Series 1. Introduction...

Method of Vue component document generation tool library

Table of contents Parsing .vue files Extract docu...

How to view the IP address of Linux in VMware virtual machine

1. First, double-click the vmware icon on the com...

Detailed explanation of firewall rule settings and commands (whitelist settings)

1. Set firewall rules Example 1: Expose port 8080...

How to design and create adaptive web pages

With the popularization of 3G, more and more peop...

Brief analysis of MySQL union and union all

In the database, both UNION and UNION ALL keyword...

How to view version information in Linux

How to view version information under Linux, incl...

JavaScript to implement the aircraft war game

This article shares with you how to use canvas an...

How to process blob data in MySQL

The specific code is as follows: package epoint.m...

Vue close browser logout implementation example

Table of contents 1. beforeunload event 2. Unload...

jQuery achieves full screen scrolling effect

This article example shares the specific code of ...

How to run postgreSQL with docker

1. Install Docker. Reference URL: Docker Getting ...

Solution to Linux CentOS 6.5 ifconfig cannot query IP

Recently, some friends said that after installing...

Summary of front-end knowledge in the Gokudō game

background In the early stages of learning Japane...