Detailed explanation of MySQL index selection and optimization

Index Model
B+Tree
Index selection
Index optimization

Index selectivity
Covering Index
Leftmost prefix principle + index pushdown
Prefix Index
Unique Index
Index failure

Summarize

Index Model

Hash Table

Applicable to scenarios with only equal value queries, the default index of the Memory engine
InnoDB supports adaptive hash indexes, which cannot be intervened. The engine decides whether to create them.

Ordered array: The performance in both equal value query and range query scenarios is excellent, but inserting and deleting data requires data movement, which is too costly. Therefore, it only applies to static storage engines.

Binary balanced tree: The left child of each node is smaller than the parent node, and the parent node is smaller than the right child. The time complexity is O(log(N))

Multi-branch balanced tree: The index is not only stored in memory, but also written to disk. In order to minimize disk reads for a query, the query process must access as few data blocks as possible. Therefore, an "N-ary" tree is used.

B+Tree

B-Tree and B+Tree

B-Tree

B+Tree

InnoDB uses the B+ tree index model. Suppose we have a table with a primary key column of ID, a field k in the table, and an index on k, as shown below:

Primary key index: also known as clustered index, the leaf node stores the entire row of data
Non-primary key index: also known as a secondary index, the leaf node content is the value of the primary key

Precautions

Indexes are based on orderly storage of data pages, so data pages may be split (insufficient page storage space) or merged (low page utilization due to data deletion).
The disordered insertion of data will cause data movement and even the splitting of data pages.
The smaller the primary key length, the smaller the leaf node of the common index, and the smaller the space occupied by the common index.
The smaller the index field, the more data can be stored in a single layer, which can reduce disk IO.

// Assume a data page of 16KB, a row of data of 1KB, an index pointer of 6 bytes, and an index field of bigint type (8 bytes)

// Number of indexes K = 16*1024/(8+6) = 1170

// Number of records in a single leaf node N = 16/1 = 16

// Layer 3 B+ record number V = K*K*N = 21902400

MyISAM also uses B+Tree indexes. The difference is that it does not distinguish between primary key and non-primary key indexes. Both are non-clustered indexes. The leaf nodes store pointers to data files.

Index selection

The purpose of the optimizer selecting an index is to find an optimal execution plan and execute the statement at the lowest cost. In the database, the number of scanned rows is one of the factors that affect the execution cost. Fewer rows scanned means fewer disk accesses and less CPU consumption.

Of course, the number of scanned rows is not the only criterion for judgment. The optimizer will also make a comprehensive judgment based on factors such as whether to use temporary tables and whether to sort.

How to calculate the number of scan lines

The more distinct values an index has, the more discriminative the index is. The number of distinct values in an index is called cardinality.

-- View the current index base mysql> show index from test;
+-------+------------+----------+--------------+--------------+--------------+-------------+------+--------+------+------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------+------------+----------+--------------+--------------+--------------+-------------+------+--------+------+------------+---------+------------+
| test | 0 | PRIMARY | 1 | id | A | 100256 | NULL | NULL | | BTREE | | |
| test | 1 | index_a | 1 | a | A | 98199 | NULL | NULL | YES | BTREE | | |
+-------+------------+----------+--------------+--------------+--------------+-------------+------+--------+------+------------+---------+------------+

From a performance perspective, InnoDB uses sampling statistics. By default, it selects N data pages, counts the different values on these pages, gets an average value, and then multiplies it by the number of pages of this index to get the cardinality of this index. Therefore, the two indexes above do not show the same cardinality.

The data table will be continuously updated, and the index statistics will not remain fixed. Therefore, when the number of changed data rows exceeds 1/M (the default is 10 when innodb_stats_persistent=on, and 16 otherwise), it will automatically trigger a re-index statistics.

mysql> show variables like '%innodb_stats_persistent%';
+--------------------------------------+-------------+
| Variable_name | Value |
+--------------------------------------+-------------+
-- Whether to automatically trigger the update of statistics. When the modified data exceeds 10%, the statistics will be recalculated | innodb_stats_auto_recalc | ON |
-- Controls whether delete-marked records are considered when recalculating statistics | innodb_stats_include_delete_marked | OFF |
-- Statistical method for null values. When the variable is set to nulls_equal, all NULL values are considered the same | innodb_stats_method | nulls_equal | 
-- Whether to trigger the update of statistics when operating metadata | innodb_stats_on_metadata | OFF |
--Whether statistics are stored persistently | innodb_stats_persistent | ON |
-- innodb_stats_persistent=on, number of sample pages for persistent statistics sampling | innodb_stats_persistent_sample_pages | 20 |
-- Deprecated, replaced by innodb_stats_transient_sample_pages | innodb_stats_sample_pages | 8 |
-- Transient sample page number | innodb_stats_transient_sample_pages | 8 |
+--------------------------------------+-------------+

In addition to inaccurate statistical cardinality due to sampling, MVCC can also lead to inaccurate cardinality statistics. For example, transaction A is started before transaction B and is not committed. Transaction B deletes some data. In repeatable read, transaction A can still query the deleted data. Currently, there are at least two versions of this data, one of which is marked as deleted.
The primary key is estimated directly based on the number of rows in the table. The optimizer directly uses the value of show table status like 't' for the number of rows in the table.
Manually trigger index statistics:

-- Recalculate index information mysql> analyze table t;

Effect of sorting on index selection

-- Create a table mysql> CREATE TABLE `t` (
`id` int(11) NOT NULL,
`a` int(11) DEFAULT NULL,
`b` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `a` (`a`),
KEY `b` (`b`)
)ENGINE=InnoDB;

-- Define the test data storage procedure mysql> delimiter;
CREATE PROCEDURE idata ()
BEGIN

DECLARE i INT ;
SET i = 1 ;
WHILE (i <= 100000) DO
 INSERT INTO t
VALUES
 (i, i, i) ;
SET i = i + 1 ;
END
WHILE ;
END;
delimiter ;

-- Execute the stored procedure and insert test data mysql> CALL idata ();

-- View the execution plan, using the index on field a mysql> explain select * from t where a between 10000 and 20000;
+----+-------------+-------+-------+---------------+-----+----------+------+-------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-----+----------+------+-------+-----------------------+
| 1 | SIMPLE | t | range | a | a | 5 | NULL | 10000 | Using index condition |
+----+-------------+-------+-------+---------------+-----+----------+------+-------+-----------------------+

-- Since field b needs to be sorted, although index b needs to scan more rows, it is ordered. Considering the number of rows to be scanned and the sorting, the optimizer chooses index b, thinking that the cost is smaller.mysql> explain select * from t where (a between 1 and 1000) and (b between 50000 and 100000) order by b limit 1;
+----+-------------+-------+-------+---------------+-----+------+------+------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-----+------+------+------+------+------------------------------------+
| 1 | SIMPLE | t | range | a,b | b | 5 | NULL | 50128 | Using index condition; Using where |
+----+-------------+-------+-------+---------------+-----+------+------+------+------+------------------------------------+

-- Solution 1: Use force index to force index a to correct the optimizer's incorrect choice. This is not recommended (not universal, and the statement needs to be changed if the index name is changed)
mysql> explain select * from t force index(a) where (a between 1 and 1000) and (b between 50000 and 100000) order by b limit 1;
+----+-------------+-------+-------+---------------+-----+------+------+------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-----+------+------+------+----------------------------------------------------+
| 1 | SIMPLE | t | range | a | a | 5 | NULL | 999 | Using index condition; Using where; Using filesort |
+----+-------------+-------+-------+---------------+-----+------+------+------+----------------------------------------------------+

-- Solution 2: Guide MySQL to use the desired index and sort by b, a. The optimizer needs to consider the cost of sorting a. mysql> explain select * from t where (a between 1 and 1000) and (b between 50000 and 100000) order by b, a limit 1;
+----+-------------+-------+-------+---------------+-----+------+------+------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-----+------+------+------+----------------------------------------------------+
| 1 | SIMPLE | t | range | a,b | a | 5 | NULL | 999 | Using index condition; Using where; Using filesort |
+----+-------------+-------+-------+---------------+-----+------+------+------+----------------------------------------------------+

-- Solution 3: In some scenarios, we can create a more appropriate index to provide to the optimizer for selection, or delete the misused index ALTER TABLE `t`
DROP INDEX `a`,
DROP INDEX `b`,
ADD INDEX `ab` (`a`,`b`);

Index optimization

Index selectivity

Index selectivity = cardinality / total number of rows

-- Selective index of field xxx in table t: select count(distinct xxx)/count(id) from t;

The selectivity of an index refers to the ratio of unique index values (cardinality) to the number of table records. Selectivity is an indicator of the index's screening ability. The index value range is 0~1. The greater the selectivity, the greater the index value.

When using a normal index query, the normal index is loaded first, the primary key of the actual row is queried through the normal index, and then the primary key is used to query the corresponding row through the clustered index, so as to query all rows in a loop. If you search the clustered index directly in its entirety, you do not need to switch back and forth between the normal index and the clustered index. Compared with the total cost of the two operations, scanning the entire table may be more efficient.

In actual work, it still depends on the business situation. If the data distribution is uneven, the actual query conditions always query the part with less data. Adding an index on the column with lower index selection may also have a good effect.

Covering Index

Covering indexes can reduce the number of tree searches and significantly improve query performance, so using covering indexes is a common performance optimization method.

-- You only need to check the value of ID, and the value of ID is already in the k index tree, so you can directly provide the query result without going back to the table select ID from T where k between 3 and 5

-- Add field V. V needs to be returned for each query. You can consider making k and v into a joint index select ID,V from T where k between 3 and 5

Leftmost prefix principle + index pushdown

-- Create a joint index on the three columns of id, name, and age. -- Satisfy the leftmost prefix principle, and both name and age are indexed. select * from T where name='xxx' and age=12

-- Mysql automatically optimizes and adjusts the order of name and age. Both name and age are indexed. select * from T where age=12 and name='xxx'

-- If name satisfies the leftmost prefix principle, the index is used. MySQL 5.6 introduces index condition pushdown optimization, that is, the records that do not satisfy age=12 are first filtered out in the index and then returned to the table. select * from T where name like 'xxx%' and age=12

-- Does not meet the leftmost prefix principle, no index is used select * from T where name like '%xxx%' and age=12

-- Satisfy the leftmost prefix principle, name uses the index select * from T where name='xxx'

-- Does not meet the leftmost prefix principle, does not use the index select * from T where age=12

Principles of establishing joint indexes:

If adjusting the order can reduce the maintenance of one index, then this order is often the one that needs to be given priority.
Space: Prioritize creating indexes for small fields separately, such as name and age. You can create a (name, age) joint index and a (age) single field index.

Prefix Index

mysql> create table SUser(
ID bigint unsigned primary key,
name varchar(64),  
email varchar(64),
...
)engine=innodb;

-- The following query scenariomysql> select name from SUser where email='xxx';

-- Solution 1: Full-text index. The number of table returns is determined by the amount of data that meets the conditions. mysql> alter table SUser add index index1(email);

-- Solution 2: prefix index. The number of table returns is determined by the prefix matching result. mysql> alter table SUser add index index2(email(6));

Prefix index can save space, but you need to pay attention to the definition of prefix length. While saving space, you should not increase the query cost too much, that is, reduce the number of table return verifications.

How to set the appropriate prefix length?

-- Preset an acceptable discrimination loss ratio and select the smallest prefix length that meets the conditions select count(distinct left(email,n))/count(distinct email) from SUser;

What if the appropriate prefix length is longer?

For example, if the ID card number meets the requirements for differentiation, a prefix index of more than 12 digits may be required. The space saved is limited and the query cost is increased. Therefore, there is no need to use a prefix index. At this point, we can consider using the following methods:

Reverse storage

-- String reversal query during query mysql> select field_list from t where id_card = reverse('input_id_card_string');

Using hash fields

-- Create an integer field to store the ID card verification code and create an index on this field mysql> alter table t add id_card_crc int unsigned, add index(id_card_crc);

-- Use the hash field to search by index, and then use the original field precision to filter mysql> select field_list from t where id_card_crc=crc32('input_id_card_string') and id_card='input_id_card_string'

Disadvantages of the above two methods:

Range queries are not supported
Using hash fields requires additional space, so a new field is added
Additional processing is required when reading and writing, such as reverse or crc32

What is the impact of prefix index on covering index?

-- If you use a prefix index, you won't need a covering index to optimize query performance. select id,email from SUser where email='xxx';

Unique Index

It is recommended to use a normal index. The unique index cannot use the change buffer and has a low memory hit rate.

Index failure

Do not perform column operations, including the use of functions, which may destroy the order of index values
Avoid %xxx queries that invalidate indexes
The or statement does not use indexes at the same time. If only one of the query fields on the left and right sides of the or statement is an index, the index becomes invalid.
Composite index ABC problem, leftmost prefix principle
Implicit type conversion
Implicit character encoding conversion
The optimizer abandons the index. Factors such as table return and sorting cost affect the optimizer. It switches to other indexes or scans the entire index.

Summarize

This is the end of this article about MySQL index selection and optimization. For more relevant MySQL index selection and optimization content, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in: