Understand the principles of MySQL persistence and rollback in one article

Understand the principles of MySQL persistence and rollback in one article

redo log

Transaction support is one of the important features that distinguishes databases from file systems. Transactions have four major characteristics:

  • Atomicity: All operations are either done or not done, and are indivisible.
  • Consistency: The result of a database changing from one state to another is ultimately consistent. For example, if A transfers 500 to B, A ends up with 500 less and B ends up with 500 more, but the value of A+B never changes.
  • Isolation: Transactions are isolated from each other and do not interfere with each other.
  • Persistence: Once a transaction is committed, its changes to data are permanent.

This article mainly talks about knowledge related to persistence.

When we update a record in a transaction, for example:

update user set age=11 where user_id=1;

Its process is roughly like this:

  • First determine whether the page where the user_id data is located is in memory. If not, read it from the database first and then load it into memory
  • Modify the age in memory to 11
  • Write redo log, and redo log is in prepare state
  • Writing to binlog
  • Commit the transaction, and the redo log changes to the committed state

There are several key points here: What is redo log? Why do we need redo logs? What is the redo log in the prepare state? Can we choose only one between redo log and binlog? With this series of questions, let's unveil the redo log.

Why do we need to update the memory data first instead of directly updating the disk data?

Why don't we directly update the corresponding disk data every time we update the data? First of all, we know that disk IO is slow and memory is fast, and their speeds are not of the same order of magnitude. So, for the slow disk IO, indexes appear. Through indexes, we can still find our data very quickly on the disk even if there are hundreds of millions of data. This is the role of indexes. However, the index also needs to be maintained and is not static. When we insert a new data A, since this data needs to be inserted after the existing data B, we have to move B data to make room for A, which has a certain overhead. What's worse is that if the page where the data is to be inserted is full, a new page must be requested and some data must be moved to it. This is called page splitting, which is more expensive. If our SQL change is to directly modify the data on disk and the above problem happens to occur, then the efficiency will be very low at this time, and in serious cases it will cause timeout. This is why the above update process first loads the corresponding data page into memory and then updates the data in memory. For MySQL, all changes must first update the data in the buffer pool, and then the dirty pages in the buffer pool will be flushed to the disk at a certain frequency ( checkPoint mechanism). The buffer pool is used to optimize the gap between the CPU and the disk, so that the overall performance will not drop too quickly.

Why do we need redo logs?

The buffer pool can help us eliminate the gap between the CPU and the disk, and the checkpoint mechanism can ensure that the data is eventually written to the disk. However, the checkpoint is not triggered every time a change occurs, but is processed by the master thread at intervals. Therefore, the worst case scenario is that the database crashes right after writing to the buffer pool, and this data is lost and cannot be recovered. In this case, D in ACID is not satisfied. In order to solve the persistence problem in this case, the InnoDB engine transaction adopts WAL technology (Write-Ahead Logging). The idea of ​​this technology is to write the log first and then write to the disk. Only when the log is written successfully, the transaction is considered to be committed successfully. The log here is the redo log. When a crash occurs and the data is not flushed to disk, it can be recovered through redo log to ensure D in ACID. This is the role of redo log.

How is redo log implemented?

The redo log is not written directly to the disk. The redo log also has a buffer, called the redo log buffer. The InnoDB engine will write the redo log buffer first when writing the redo log, and then flush it to the real redo log at a certain frequency. The redo log buffer generally does not need to be very large. It is just a temporary container. The master thread will flush the redo log buffer to the redo log file every second. Therefore, we only need to ensure that the redo log buffer can store the amount of data changed by the transaction within 1 second. Taking mysql5.7.23 as an example, the default is 16M.

mysql> show variables like '%innodb_log_buffer_size%';
+------------------------+----------+
| Variable_name | Value |
+------------------------+----------+
| innodb_log_buffer_size | 16777216 |
+------------------------+----------+

A 16M buffer is sufficient for most applications. There are several strategies for synchronizing the buffer to the redo log:

  • The master thread flushes the buffer to the redo log every second
  • When each transaction is committed, the buffer is flushed to the redo log.
  • When the remaining space in the buffer is less than 1/2, it will be flushed to the redo log

It should be noted that the process of flushing the redo log buffer to the redo log is not actually flushing it to the disk, but only flushing it to the OS cache. This is an optimization made by modern operating systems to improve the efficiency of file writing. The actual writing will be decided by the system itself (for example, if the OS cache is large enough). Then there is a problem for InnoDB. If fsync is left to the system, if the system crashes, the data will also be lost (although the probability of the entire system crashing is still relatively small). In response to this situation, InnoDB provides the innodb_flush_log_at_trx_commit strategy, allowing users to decide which one to use.

mysql> show variables like 'innodb_flush_log_at_trx_commit';
+--------------------------------+-------+
| Variable_name | Value |
+--------------------------------+-------+
| innodb_flush_log_at_trx_commit | 1 |
+--------------------------------+-------+
  • 0: After the transaction is committed, no fsync is performed, but the master performs fsync of the redo log every 1 second.
  • 1: Default value, fsync is performed synchronously every time a transaction is committed
  • 2: After writing to the OS cache, let the operating system decide when to fsync

From the three brush-in strategies:

2 is definitely the most efficient, but as long as the operating system crashes, the data in the OS cache will be lost, and in this case, the ACID D

If it is 0 , it is a compromise. Its IO efficiency is theoretically higher than 1 and lower than 2. Its data security is theoretically lower than 1 and higher than 2. This strategy also has the risk of data loss and cannot guarantee D.

1 is the default value, which can guarantee D and data will never be lost, but it has the worst efficiency. I personally recommend using the default value. Although the probability of an operating system crash is theoretically lower than the probability of a database crash, since transactions are used, data security should be relatively more important.

Redo log is a physical modification of the page. The xth position of page x is modified to xx , for example:

page(2,4),offset 64,value 2

In the InnoDB engine, redo logs are stored in units of 512 bytes. Each storage unit is called a redo log block. If the amount of logs stored in a page is greater than 512 bytes, it needs to be logically divided into multiple blocks for storage.

A redo log block consists of a log header, a log body, and a log tail. The log header occupies 12 bytes and the log tail occupies 8 bytes, so the data that can actually be stored in a block is 512-12-8=492 bytes.

Multiple redo log blocks make up our redo log.

The default size of each redo log is 48M:

mysql> show variables like 'innodb_log_file_size';
+----------------------+----------+
| Variable_name | Value |
+----------------------+----------+
| innodb_log_file_size | 50331648 |
+----------------------+----------+

By default, InnoDB uses two redo logs to form a log group, and this log group is the one that actually does the work.

mysql> show variables like 'innodb_log_files_in_group';
+---------------------------+-------+
| Variable_name | Value |
+---------------------------+-------+
| innodb_log_files_in_group | 2 |
+---------------------------+-------+
#ib_logfile0
#ib_logfile1

When ib_logfile0 is finished, ib_logfile1 will be written. When ib_logfile1 is finished, ib_logfile0 will be written again... and the cycle continues.

Why is a block designed to be 512 bytes?

This is related to the sectors of the disk. The default sector of a mechanical disk is 512 bytes. If the data you want to write is larger than 512 bytes, then you must have more than one sector to write. At this time, the disk must rotate to find the next sector. Suppose you need to write two sectors A and B now. If sector A is written successfully and sector B fails to be written, then non-atomic writing will occur. If you only write 512 bytes, the same size as the sector, each write is atomic.

Why two-stage submission?

From the above, we know that the transaction submission must first write the redo log (prepare), then write the binlog, and finally commit (commit). Why is there a prepare action here? Can't I just commit the redo log? Assume that the redo log is submitted directly, and a crash occurs when writing the binlog, then the binlog will not have the corresponding data. Then all slaves that rely on the binlog to recover data will not have the corresponding data, resulting in inconsistency between the master and the slave. Therefore, it is necessary to use two-phase (2pc) submission to ensure the consistency of redo log and binlog. The specific steps are as follows: the redo log in the prepare state will record the XID of 2PC, and the binlog will also record the XID of 2PC after it is written, and the commit mark will be marked on the redo log.

Is it possible to only need one of the redo log and bin log?

Can't. The size of the redo log itself is fixed. When it is full, it will start over from the beginning and overwrite the old data. Because the redo log cannot save all the data, it is not feasible to synchronize data to the slave library through the redo log in the master-slave mode. Then binlog is definitely needed. Binlog is generated by the server layer of MySQL and has nothing to do with the storage engine. Binglog is also called archive log. When a binlog file is full, it will be written to a new binlog file. So do we only need binlog? Is it possible to not need redo log? Of course not. The role of redo log is to provide crash-safe capability. First of all, for a data modification, the data page in the buffer pool is modified first. At this time, the modified data is not actually written to the disk. This is mainly because the discrete read and write efficiency of the disk is low. The work of actually writing data to the disk is handled by the master thread regularly. The advantage is that the master can write multiple modifications to the disk at one time. Then there is a problem. After the transaction is committed, the data is in the dirty page of the buffer and has not been flushed to the disk in time. At this time, the database crashes. Then the committed data cannot be restored even after the database is restored, and it cannot meet the D in ACID. Then there is the redo log. From the process point of view, the submission of a transaction must ensure the successful writing of the redo log. Only when the redo log is written successfully can the transaction be considered successfully submitted. In most cases, the redo log is written sequentially to the disk, so its efficiency is much higher. When a crash occurs after commit, we can recover the data through redo log, which is why redo log is needed. However, the submission of the transaction also requires the successful writing of the binlog, so why can't the data that has not been written to the disk be restored through the binlog? This is because binlog does not know which data has been written to the disk, so it does not know which data needs to be restored. For redo log, the data in the corresponding redo log will be deleted after the data is written to the disk. Therefore, after the database is restarted, you only need to restore the remaining data in the redo log.

How to recover after a crash?

Through two-stage commit, we know that redo log and binlog will be marked with prepare or commit at each stage, and the transaction XID will also be recorded. With this data, when the database is restarted, it will first check all transactions in the redo log. If the transaction in the redo log is in the commit state, it means that a crash occurred after the commit. At this time, just restore the data in the redo log. If the redo log is in the prepare state, it means that a crash occurred before the commit. At this time, the state of the binlog determines the state of the current transaction. If there is a corresponding XID in the binlog, it means that the binlog has been written successfully, but it has not been committed in time. At this time, just execute commit again. If the corresponding XID cannot be found in the binlog, it means that the binlog crashed before it was written successfully, so a rollback should be executed at this time.

undo log

Redo log guarantees transaction persistence, and undo log guarantees transaction atomicity. The pre-operation of updating data in a transaction is actually to write it into an undo log first, so its process is roughly as follows:

Under what circumstances will an undo log be generated?

The function of undo log is mvcc (multi-version control) and rollback. Here we mainly talk about rollback. When we insert, update, or delete certain data in a transaction, a corresponding undo log will be generated. When we perform a rollback, we can return to the beginning of the transaction through the undo log. It should be noted that rollback does not modify the physical page, but restores the logic to its original state. For example, a data A is modified to B in a transaction, but another transaction has already modified it to C. If the rollback directly modifies the data page and changes the data to A, then C will be overwritten.

For the InnoDB engine, each row record has several hidden columns in addition to the data of the record itself:

  • DB_ROW_ID : If there is no primary key explicitly defined for the table and there is no unique index defined in the table, InnoDB will automatically add a hidden column row_id as the primary key to the table.
  • DB_TRX_ID : Each transaction is assigned a transaction ID. When a record is changed, the transaction ID of this transaction is written into trx_id.
  • DB_ROLL_PTR : Rollback pointer, essentially a pointer to the undo log.

When we do an INSERT:

begin;
INSERT INTO user (name) VALUES ("tom")

Each inserted data will generate an insert undo log, and the rollback pointer of the data will point to it. The undo log will record the sequence number of the undo log, the column and value inserted into the primary key, etc., so when rolling back, you can directly delete the corresponding data through the primary key.

For update operations, an update undo log will be generated, and it will be divided into those that update the primary key and those that do not update the primary key. Suppose you execute the following command now:

UPDATE user SET name="Sun" WHERE id=1; 

At this time, the old record will be written to the new undo log, and the rollback pointer will point to the new undo log, whose undo number is 1, and the new undo log will point to the old undo log (undo no=0).

Suppose now you execute:

UPDATE user SET id=2 WHERE id=1; 

For the operation of updating the primary key, the deletemark flag of the original data will be turned on first. At this time, the data is not actually deleted. The real deletion will be judged by the cleanup thread, and then a new data will be inserted behind it. The new data will also generate an undo log, and the sequence number of the undo log will increase.

It can be found that each change to the data will generate an undo log. When a record is changed multiple times, multiple undo logs will be generated. The undo log records the log before the change, and the sequence number of each undo log is incremental. So when you want to roll back, you can find our original data by pushing forward according to the sequence number.

How is undo log rolled back?

Taking the above example as an example, assuming that rollback is executed, the corresponding process should be as follows:

  • Delete the data with id=2 through the log with undo no=3
  • Restore the deletemark of the data with id=1 to 0 by undoing the log with no=2
  • Restore the name of the data with id=1 to Tom by undoing the log with no=1.
  • Delete the data with id=1 by undoing the log with no=0

Where does the undo log exist?

InnoDB manages undo logs in segments, that is, rollback segments. Each rollback segment records 1024 undo log segments. The InnoDB engine supports 128 rollback segments by default.

mysql> show variables like 'innodb_undo_logs';
+------------------+-------+
| Variable_name | Value |
+------------------+-------+
| innodb_undo_logs | 128 |
+------------------+-------+

Then the maximum concurrent transactions that can be supported is 128*1024. Each undo log segment is like maintaining an array with 1024 elements.

When we start a transaction and need to write undo log, we must first find a free position in the undo log segment. When there is a vacant position, we will apply for an undo page, and finally write the undo log in this applied undo page. We know that the default page size of MySQL is 16k.

mysql> show variables like '%innodb_page_size%';
+------------------+-------+
| Variable_name | Value |
+------------------+-------+
| innodb_page_size | 16384 |
+------------------+-------+

So allocating one page for one transaction is actually very wasteful (unless your transaction is very long). Assuming your application's TPS is 1000, then 1000 pages are needed for 1 second, which requires about 16M of storage, and about 1G of storage is needed for 1 minute... If you continue like this, unless MySQL is cleaned up very diligently, the disk space will grow very quickly over time, and a lot of space will be wasted. Therefore, the undo page is designed to be reused. When a transaction is committed, the undo page is not deleted immediately. Because of reuse, the undo page may not be clean, so the undo page may be mixed with the undo log of other transactions. After the undo log is committed, it will be placed in a linked list, and then it will be determined whether the used space of the undo page is less than 3/4. If it is less than 3/4, it means that the current undo page can be reused, so it will not be recycled, and the undo log of other transactions can be recorded behind the current undo page. Since the undo log is discrete, the efficiency of cleaning up the corresponding disk space is not that high.

This is the end of this article about understanding the principles of MySQL persistence and rollback. For more relevant MySQL persistence and rollback content, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • Detailed explanation of global parameter persistence in MySQL 8 new features
  • Detailed explanation of MySQL persistent statistics
  • MySQL 8 new features: how to modify persistent global variables
  • MySQL 8 new features: detailed explanation of persistence of auto-increment primary key
  • MySQL master-slave synchronization, implementation principle of transaction rollback
  • Detailed example of how to implement transaction commit and rollback in mysql
  • How to quickly roll back MySQL database after an incorrect operation
  • What are the reasons why MySql transactions cannot be rolled back?
  • MySQL implements transaction commit and rollback examples
  • Analysis of 2 issues of MYSQL transaction rollback

<<:  HTML+CSS project development experience summary (recommended)

>>:  Summary of Common Mistakes in Web Design

Recommend

DHTML objects (common properties of various HTML objects)

!DOCTYPE Specifies the Document Type Definition (...

Summary of some thoughts on binlog optimization in MYSQL

question Question 1: How to solve the performance...

How to use React to implement image recognition app

Let me show you the effect picture first. Persona...

HTML page native VIDEO tag hides the download button function

When writing a web project, I encountered an intr...

Brief analysis of the introduction and basic usage of Promise

Promise is a new solution for asynchronous progra...

How to completely delete the MySQL service (clean the registry)

Preface When installing the executable file of a ...

Start a local Kubernetes environment using kind and Docker

introduce Have you ever spent a whole day trying ...

React+Typescript implements countdown hook method

First, setInterval is encapsulated as a Hook 👇 im...

Basic understanding and use of HTML select option

Detailed explanation of HTML (select option) in ja...

How to deal with time zone issues in Docker

background When I was using Docker these two days...

Detailed tutorial on installing Docker on CentOS 8

1. Previous versions yum remove docker docker-cli...

11 Linux KDE applications you didn't know about

KDE Abbreviation for Kool Desktop Environment. A ...

js, css, html determine the various versions of the browser

Use regular expressions to determine the IE browse...

Detailed explanation of the new CSS display:box property

1. display:box; Setting this property on an eleme...