Disadvantages and reasonable use of MySQL database index

Disadvantages and reasonable use of MySQL database index

A good index is particularly important for a database system. The index can be said to be the heart of a database. If a database lacks an index, then the database itself has little meaning and is no different from an ordinary file. Today, let’s talk about MySQL indexes. Let’s look at the benefits of B+ tree indexes in MySQL from a detailed and practical business perspective, as well as the knowledge points we need to pay attention to when using indexes.

Proper use of indexes

At work, the most direct way for us to determine whether a field in a data table needs to be indexed is to see whether this field often appears in our where conditions. From a macro perspective, there is nothing wrong with thinking this way, but from a long-term perspective, sometimes more detailed thinking may be required, such as whether we need to create more than just an index on this field? Is it better to have a joint index of multiple fields? Taking a user table as an example, the fields in the user table may include the user's name, the user's ID number, the user's home address, and so on.

1. Disadvantages of ordinary indexes

Now there is a requirement to find the user's name based on the user's ID number. At this time, the first solution that comes to mind is to create an index on id_card . Strictly speaking, it is a unique index because the ID number must be unique. So when we execute the following query:

SELECT name FROM user WHERE id_card=xxx

Its process should be like this:

  • First search on the id_card index tree to find the primary key id corresponding to id_card
  • Search the primary key index by id and find the corresponding name

From the perspective of performance, the result is fine, but from the perspective of efficiency, it seems that this query is a bit expensive because it retrieves two B+ trees. Assuming that the height of one tree is 3, the height of the two trees is 6. Because the root nodes are in the memory (two root nodes here), the number of IOs to be performed on the disk is 4. Assuming that the average time for a random disk IO is 10ms, it will take 40ms in the end. This number is average, not fast.

2. The pitfalls of primary key index

Since the problem is table return, which results in searching in both trees, the core question is whether it is possible to search in only one tree. From a business perspective, you may have found a breakthrough point here. The ID number is unique, so can we use a different primary key than the default auto-increment ID? We can set the primary key to our ID number, so that the entire table only needs one index, and all the required data including our name can be found through the ID number. It seems to make sense at first glance, as long as we specify that the ID is the ID number each time we insert data. However, there seems to be a problem if we think about it carefully.

Here we need to talk about the characteristics of B+ tree. The data of B+ tree is stored on leaf nodes, and the data is managed in pages. One page is 16K. What does this mean? Even if we have a row of data now, it will occupy a 16K data page. Only when our data page is full will it be written to a new data page. The new data page and the old data page are not necessarily physically continuous. And one thing is very critical, although the data page is physically discontinuous, the data is logically continuous.

Maybe you are curious, what does this have to do with the ID number as the primary key ID? At this time, you should pay attention to the keyword "continuous". The ID number is not continuous. What does it mean? When we insert a discontinuous piece of data, we need to move the data to maintain continuity. For example, if the original data on a page is 1->5, and a piece of data 3 is inserted, then we need to move 5 to the back of 3. You may say that this does not cost much. However, if the new data 3 causes page A to be full, then we need to see if there is room on page B behind it. If there is room, the starting data of page B should be the piece that overflowed from page A, and the corresponding data must also be moved.

If page B does not have enough space at this time, then a new page C must be applied for, and then part of the data must be moved to this new page C. The relationship between page A and page B will be cut off, and page C will be inserted between the two. From a code perspective, this is to switch the pointer of the linked list.

In summary, using discontinuous ID numbers as primary keys may cause overhead related to page data movement, random IO, and frequent requests for new pages. If we use an auto-increment primary key, then the ID must be sequential, there will be no problem of data movement due to random IO, and the insertion overhead must be relatively small.

In fact, there is another reason why it is not recommended to use the ID number as the primary key: the ID number is too large as a number, so it has to be stored in bigint. Normally, int is enough for a school's students. We know that one page can store 16K. The larger the space occupied by an index itself, the less data can be stored in one page. Therefore, in the case of a certain amount of data, using bigint requires more pages than int, which means more storage space.

3. The Spear and Shield of Joint Index

From the above two conclusions we can draw:

  • Try not to return to the table
  • ID number is not suitable as primary key index

So naturally I thought of the joint index and created a joint index of [ID number + name]. Pay attention to the order of the joint index, which must comply with the leftmost principle. So when we execute the following SQL:

select name from user where id_card=xxx

We can get the name field we need without returning to the table. However, the problem that the ID card number itself takes up too much space has not been solved. This is a problem with the business data itself. If you want to solve it, we can use some conversion algorithms to convert the original large data into small data, such as crc32:

crc32.ChecksumIEEE([]byte("341124199408203232"))

The ID number that originally required 8 bytes of storage space can be replaced by a 4-byte CRC code. Therefore, our database needs to add another field crc_id_card , and the joint index is changed from [ID number + name] to [crc32 (ID number) + name], and the space occupied by the joint index becomes smaller. But this conversion also comes at a cost:

  • Each additional CRC requires more CPU resources
  • Although the additional fields reduce the space required for the index, they also take up space themselves.
  • There is a probability of conflict in crc, which requires us to query the data and then filter it according to id_card. The cost of filtering depends on the number of duplicate data. The more duplicates, the slower the filtering.

Regarding the storage optimization of the joint index, here is a small detail. Suppose there are two fields A and B, which occupy 8 bytes and 20 bytes respectively. When the joint index is already [A, B], we also need to support separate queries of B. Therefore, we naturally create an index on B. Then the space occupied by the two indexes is 8+20+20=48. Now we can use the index whether we query through A or B. If the business allows, can we create [B, A] and A indexes? In this way, not only can we use the index to query data through A or B alone, but also occupy less space: 20+8+8=36.

4. Prefix index is short and powerful

Sometimes the field we need to index is of string type, and this string is very long. We want to add an index to this field, but we don't want this index to take up too much space. In this case, we can consider creating a prefix index and create an index with the first part of the characters of this field. This way, we can enjoy the index and save space. It should be noted here that when the prefix repetition rate is high, there should be a difference in speed between the prefix index and the ordinary index.

alter table xx add index(name(7));#Create an index based on the first 7 characters of name select xx from xx where name="JamesBond"

5. The speed and slowness of unique index

Before talking about unique indexes, let's first understand the characteristics of ordinary indexes. We know that for B+ trees, the data of leaf nodes are ordered.

Suppose now we want to query the data 2. When 2 is found through the index tree, the storage engine does not stop searching, because there may be multiple 2s. This means that the storage engine will continue to search backwards on the leaf node. After finding the second 2, does it stop? The answer is no, because the storage engine does not know whether there are more 2s behind it, so it has to continue searching backwards until it finds the first data that is not 2, which is 3. After finding 3, it stops searching. This is the retrieval process of a normal index.

The unique index is different. Due to its uniqueness, there can be no duplicate data. Therefore, after retrieving our target data, it will be returned directly without having to search back one more time like a normal index. From this perspective, the unique index is faster than the normal index. However, when the data of the normal index is all on one page, it is not much faster. In terms of data insertion, unique indexes may be slightly inferior because of their uniqueness. Each time an insertion is made, it is necessary to determine whether the data to be inserted already exists, while ordinary indexes do not require this logic. In addition, a very important point is that unique indexes do not use the change buffer (see below).

6. Don’t add indexes blindly

At work, you may encounter a situation like this: Do I need to add an index to this field? . For this problem, we usually judge by whether the query will use this field. If this field is often included in the query conditions, we may consider adding an index. But if you judge based on this condition alone, you may add a wrong index. Let's take an example: suppose there is a user table with about 1 million data. There is a gender field in the user table to indicate males and females, and males and females account for about half of the total. Now we want to count the information of all males, and then we add an index to the gender field, and we write the SQL like this:

select * from user where sex="男"

If nothing unexpected happens, InnoDB will not select the gender index. If gender index is used, then the table must be returned. If the amount of data is large, what consequences will the return have? I'll post a picture similar to the one above, I'm sure everyone knows it:

The main thing is a large amount of IO. One piece of data needs 4 times, so what about 500,000 pieces of data? The result is predictable. Therefore, in this case, the MySQL optimizer is likely to perform a full table scan and directly scan the primary key index, because this may result in higher performance.

7. Index failure

In some cases, due to our own improper use, MySQL cannot use the index. This usually happens easily in type conversion. Maybe you will say, doesn't MySQL already support implicit conversion? For example, we have an integer user_id index field. We didn’t pay attention when querying it and wrote:

select xx from user where user_id="1234"

Note that this is the character 1234. When this happens, MySQL is indeed smart enough to convert the character 1234 into the numeric 1234, and then happily use the user_id index. But if we have a character-type user_id index field, or because we didn't pay attention when querying, we wrote it as:

select xx from user where user_id=1234

There is a problem at this time, and the index will not be used. You may ask, why doesn’t MySQL convert it at this time? Isn’t it enough to convert the numeric 1234 into the character type 1234? Here we need to explain the conversion rules. When comparing strings and numbers, remember that MySQL will convert strings into numbers. You may ask: Why is the index not needed after converting the character-type user_id field to a number? This is related to the structure of the B+ tree index. We know that the index of the B+ tree is forked and sorted according to the index value. When we convert the index field type, the value will change. For example, if the original value is A, if an integer conversion is performed, it may correspond to a B value (int(A)=B). In this case, the index tree cannot be used because the index tree is constructed according to A, not B, so the index will not be used.

Index optimization

1. Change buffer

We know that when updating a piece of data, we must first determine whether the page of this data is in the memory. If so, directly update the corresponding memory page. If not, we can only go to the disk to read the corresponding data page into the memory, and then update it. What problems will this cause?

  • The read operation to disk is a bit slow.
  • If a lot of data is updated at the same time, then many discrete IOs may occur.

In order to solve the speed problem in this situation, change buffer came into being. First of all, don’t be misled by the word buffer. In addition to being in the public buffer pool, the change buffer will also be persisted to the disk. After we have the change buffer, during the update process, if we find that the corresponding data page is not in the memory, we do not read the corresponding data page from the disk, but put the data to be updated into the change buffer. When will the data in the change buffer be synchronized to the disk? What if a read action occurs at this time? First, there is a thread in the background that will regularly synchronize the change buffer data to the disk. If the thread has not had time to synchronize, but a read operation occurs, it will also trigger an event to merge the change buffer data to the disk.

It should be noted that not all indexes can use the changer buffer. Primary key indexes and unique indexes cannot use it. Because of the uniqueness, they need to determine whether the data exists when updating. If the data page is not in the memory, the corresponding data page must be read from the disk into the memory. This is not a problem for ordinary indexes, and there is no need to verify uniqueness. The larger the change buffer, the greater the theoretical benefit. This is because firstly, the discrete read IO is reduced, and secondly, when multiple changes occur on a data page, they only need to be merged to the disk once. Of course, not all scenarios are suitable for the changer buffer. If your business needs to read immediately after the update, the changer buffer will be counterproductive because the merge action needs to be triggered continuously, resulting in the number of random IOs not decreasing, but increasing the overhead of maintaining the changer buffer.

2. Index pushdown

We have talked about joint indexes before. Joint indexes must meet the leftmost principle. That is, when the joint index is [A, B], we can use the index through the following SQL:

select * from table where A="xx"
select * from table where A="xx" AND B="xx"

In fact, the joint index can also use the leftmost prefix principle, that is:

select * from table where A like "赵%" AND B="沪"

But it should be noted here that because part of A is used, before MySQL5.6, the above SQL will immediately return to the table (using select *) after retrieving all the data where A starts with "Zhao", and then compare B to see if it is "Shanghai City". Isn't this a bit confusing? Why doesn't B make the judgment directly on the joint index? Wouldn't that reduce the number of table returns? The reason for this problem is still because of the use of the leftmost prefix. As a result, although the index can use part of A, it cannot use B at all. It looks a bit "stupid". Therefore, after MySQL5.6, the index pushdown optimization (Index Condition Pushdown) appeared. With this function, although the leftmost prefix is ​​used, it is also possible to search for data that meets A% on the joint index while filtering out non-B data, greatly reducing the number of table returns.

3. Refresh the adjacent page

Before talking about refreshing adjacent pages, let's talk about dirty pages. We know that when updating a piece of data, we must first determine whether the page where the data is located is in the memory. If not, we need to read the data page into the memory first, and then update the data in the memory. At this time, you will find that the page in the memory has the latest data, but the page on the disk is still the old data. At this time, the page in the memory where the data is located is a dirty page and needs to be flushed to the disk to keep it consistent. So the question is, when to brush? How many dirty pages should be flushed each time? If you flush the data after every change, the performance will be very poor. If you flush the data after a long time, dirty pages will accumulate, resulting in fewer available pages in the memory pool, which will affect normal functions. Therefore, the flushing speed cannot be too fast but must be timely. MySQL has a cleanup thread that will be executed regularly to ensure that it will not be too fast. When there are too many dirty pages or the redo log is almost full, it will immediately trigger the flushing of the disk to ensure timeliness.

In the process of flushing dirty pages, InnoDB has an optimization: if the neighboring pages of the dirty page to be flushed are also dirty, then they will be flushed together. The advantage of this is that random IO can be reduced. In the case of mechanical disks, the optimization should be quite large, but there may be a pitfall. If the neighboring dirty pages of the current dirty page are flushed together, and then the neighboring pages immediately become dirty again due to data changes, then it feels like a redundant move, and it wastes time and expenses. Even worse is if the neighbor of the neighbor page is also dirty... then this chain reaction may cause short-term performance problems.

4.MRR

In actual business, we may be told to use covering indexes as much as possible and not to return to the table, because returning to the table requires more IO and takes longer time. However, sometimes we have to return to the table. Returning to the table will not only cause too much IO, but more seriously, too much discrete IO.

select * from user where grade between 60 and 70

Now we need to query the user information whose grades are between 60 and 70, so our SQL is written as above. Of course, our grade field is indexed. According to common sense, we will first find the data of grade=60 on the grade index, and then look for it on the primary key index according to the id corresponding to the data of grade=60, and finally go back to the grade index again, and repeat the same action..., Assume that now the id=1 corresponding to grade=60 is on page_no_1 , the id=10 corresponding to grade=61 is on page_no_2 , and the id=2 corresponding to grade=62 is on page_no_1 , so the actual situation is to find the data on page_no_1 first, then switch to page_no_2, and finally switch back to page_no_1. But in fact, id=1 and id=2 can be completely merged, and page_no_1 only needs to be read once, which not only saves IO, but also avoids random IO, which is MRR. After using MRR, the auxiliary index will not go back to the table immediately, but will put the obtained primary key ID in a buffer and then sort it. After sorting, the primary key index will be read sequentially, which greatly reduces discrete IO.

at last

The above is the details of the pitfalls of MySQL database index and its rational use. For more information about MySQL index pitfalls and rational use, please pay attention to other related articles on 123WORDPRESS.COM!

You may also be interested in:
  • Detailed explanation of MySQL database indexes and failure scenarios
  • Detailed introduction to MySQL database index
  • Detailed explanation of MySQL database index
  • MySQL Database Indexes and Transactions
  • MySQL database index order by sorting detailed explanation
  • The leftmost matching principle of MySQL database index
  • Detailed explanation of transactions and indexes in MySQL database
  • Mysql database index interview questions (basic programmer skills)
  • Why does the index in the Mysql database table not improve the query speed?

<<:  Will CSS3 really replace SCSS?

>>:  Bootstrap 3.0 study notes for beginners

Recommend

Vue.js implements tab switching and color change operation explanation

When implementing this function, the method I bor...

What to do if you forget your mysql password

Solution to forgetting MySQL password: [root@loca...

How to use webSocket to update real-time weather in Vue

Table of contents Preface About webSocket operati...

Vue uniapp realizes the segmenter effect

This article shares the specific code of vue unia...

Implement group by based on MySQL to get the latest data of each group

Preface: The group by function retrieves the firs...

mysql5.7.17 installation and configuration example on win2008R2 64-bit system

123WORDPRESS.COM has explained to you the install...

CSS3 animation – steps function explained

When I was looking at some CSS3 animation source ...

MySQL index for beginners

Preface Since the most important data structure i...

Summary of Linux file basic attributes knowledge points

The Linux system is a typical multi-user system. ...

Summary of the use of TypeScript in React projects

Preface This article will focus on the use of Typ...

WEB Chinese Font Application Guide

Using fonts on the Web is both a fundamental skill...

Steps to change mysql character set to UTF8 under Linux system

Table of contents 1. Check the MySQL status in th...