Why does MySQL database index choose to use B+ tree?

Why does MySQL database index choose to use B+ tree?

Before further analyzing why MySQL database index chooses to use B+ tree, I believe that many friends are still a little vague about the tree in the data structure, so we will explore the evolution process of the tree step by step from the shallower to the deeper, and then introduce the B tree step by step and why MySQL database index chooses to use B+ tree!

Those who have studied data structures generally have some understanding of the most basic tree, so we will start with the binary search tree which is closer to our topic.

1. Binary Search Tree

(1) Introduction to binary tree:

A binary search tree is also called an ordered binary search tree. It satisfies the general properties of a binary search tree, which means that an empty tree has the following properties:

1. If the left subtree of any node is not empty, the value of the left subtree is less than the value of the root node;

2. If the right subtree of any node is not empty, the value of the right subtree is greater than the value of the root node;

3. The left and right subtrees of any node are also binary search trees;

4. There are no nodes with equal key values;


The above picture is an ordinary binary search tree. According to the in-order traversal method, the output can be sorted from small to large: 2, 3, 5, 6, 7, 8.

To search the above binary tree, for example, to search for a record with a key value of 5, first find the root, whose key value is 6. 6 is greater than 5, so search the left subtree of 6 and find 3. Since 5 is greater than 3, search its right subtree. A total of 3 searches are performed. If you search for the same requirement three times in the order of 2, 3, 5, 6, 7, 8. Use the same method to search for the record with a key value of 8. This time it takes 3 searches, while a sequential search requires 6 searches. Calculating the average number of searches, we get: the average number of searches for sequential search is (1+2+3+4+5+6)/6 = 3.3 times, and the average number of searches for a binary search tree is (3+3+3+2+2+1)/6 = 2.3 times. The average search speed of a binary search tree is faster than that of a sequential search.

(2) Limitations and applications

A binary search tree is composed of n nodes randomly, so, for some cases, the binary search tree will degenerate into a linear chain with n nodes. As shown below:


Look at the picture above. If our root node is the smallest or largest number, then the binary search tree will completely degenerate into a linear structure. The average number of searches in the above figure is (1+2+3+4+5+5)/6=3.16 times, which is similar to sequential search. Obviously, the query efficiency of this binary tree is very low. Therefore, if you want to construct a binary search tree with maximum performance, this binary tree needs to be balanced (the balance here can be seen from a significant feature that the height of this tree is greater than the height of the previous input, which is unbalanced in the case of the same nodes), which leads to a new definition - balanced binary tree AVL.

2. AVL Tree

(1) Introduction

The AVL tree is a binary search tree with a balance condition. It generally uses the balance factor difference to determine whether it is balanced and achieves balance through rotation. The height of the left and right subtrees does not exceed 1. Compared with the red-black tree, it is a strictly balanced binary tree, and the balance condition must be met (the height difference of the left and right subtrees of all nodes does not exceed 1). Regardless of whether we are performing insertion or deletion operations, as long as the above conditions are not met, we must maintain balance through rotation, and rotation is very time-consuming. From this we can know that AVL tree is suitable for situations where the number of insertions and deletions is relatively small, but there are many searches.

From the above, we can see that this is an ordinary balanced binary tree. From this picture, we can see that the difference in balance factors between the left and right subtrees of any node will not be greater than 1.

(2) Limitations

Since the cost of maintaining this high degree of balance is greater than the efficiency gain obtained from it, there are not many practical applications. More places use red-black trees that pursue local rather than very strict overall balance. Of course, if the application scenario does not require frequent insertions and deletions, but has high requirements for searching, then AVL is still better than the red-black tree.

(3) Application

1. Widely present in the Windows NT kernel;

3. Red-Black Tree

(1) Introduction

A binary search tree, but with an additional bit at each node to represent the color of the node, which can be red or black. By restricting the way that nodes can be colored on any path from the root to a leaf, a red-black tree ensures that no path is twice as long as any other path. It is a weakly balanced binary tree (because it is balanced, it can be deduced that under the same node conditions, the height of the AVL tree is lower than that of the red-black tree). Compared with the strictly required AVL tree, its number of rotations is reduced, so when there are many search, insertion, and deletion operations, we use the red-black tree.

(2) Nature

1. Each node is either red or black;

2. The root node is black;

3. Each leaf node (leaf node is the NULL pointer or NULL node at the end of the tree) is black;

4. If a node is red, then both of its sons are black;

5. For any node, each path to the NULL pointer of the leaf tree contains the same number of black nodes;

6. Each path contains the same black nodes;

(3) Application

1. Widely used in C++ STL, Map and Set are both implemented using red-black trees;

2. The famous Linux process scheduler Completely Fair Scheduler uses a red-black tree to manage the process control block. The virtual memory area of ​​the process is stored in a red-black tree. Each virtual address area corresponds to a node of the red-black tree. The left pointer points to the adjacent address virtual storage area, and the right pointer points to the adjacent high address virtual address space.

3. The implementation of IO multiplexing epoll adopts red-black tree organization to manage sockfd to support fast addition, deletion, modification and query;

4. Nginx uses a red-black tree to manage timers, because the red-black tree is ordered and the timer with the smallest distance to the current one can be quickly obtained;

5. Implementation of TreeMap in Java;

4. B/B+ Tree

Having talked about the above three types of trees: binary search tree, AVL and red-black tree, it seems that we have not yet figured out why MySQL uses B+ tree as the implementation of index. Don't worry, let's first discuss what B-tree is.

(1) Introduction

Our data in MySQL is generally stored on disk. When reading data, there must be an operation to access the disk. There are two mechanical moving parts in the disk, namely the rotation of the disk and the movement of the magnetic arm. The rotation of the platter is the number of revolutions per minute mentioned in the market, while the disk movement is when the platter rotates to the specified position and moves the magnetic arm to start reading and writing data. Then there is a process of locating the block on the disk, and positioning is a time-consuming part of disk access, after all, the time spent on mechanical movement is much longer than the time spent on electronic movement. When large amounts of data are stored on disk, positioning is obviously a very time-consuming process, but we can optimize it through B-trees to improve the efficiency of positioning when reading disks.

Why can B-tree be optimized? Based on the characteristics of the B-tree, we can construct a multi-level B-tree, and then store relevant information on as many nodes as possible, ensuring that the number of layers is as small as possible, so that we can find information faster later, and the disk I/O operations are also less. Moreover, the B-tree is a balanced tree, and the height from each node to the leaf node is the same, which also ensures that each query is stable.

In general, the B/B+ tree is a balanced multi-way search tree designed for disks or other storage devices (compared to binary, each internal node of the B-tree has multiple branches). Compared with the red-black tree, under the same node situation, the height of a B/B+ tree is much smaller than the height of a red-black tree (mentioned in the performance analysis of the B/B+ tree below). The time for operations on a B/B+ tree usually consists of two parts: the time to access the disk and the CPU calculation time. The CPU is very fast, so the operating efficiency of the B-tree depends on the number of times the disk is accessed. When the total number of keywords is the same, the smaller the height of the B-tree, the less time is spent on disk I/O.

Note that a B-tree is a B-tree, and the - is just a symbol.

(2) Properties of B-tree

1. Define that any non-leaf node has at most M sons, and M>2;

2. The number of sons of the root node is [2, M];

3. The number of sons of non-leaf nodes other than the root node is [M/2, M];

4. Each node stores at least M/2-1 (rounded up) and at most M-1 keywords; (at least 2 keywords)

5. The number of keywords of non-leaf nodes = the number of pointers to sons - 1;

6. Keywords of non-leaf nodes: K[1], K[2], …, K[M-1]; and K[i] < K[i+1];

7. Pointers to non-leaf nodes: P[1], P[2], …, P[M]; P[1] points to the subtree whose key is less than K[1], P[M] points to the subtree whose key is greater than K[M-1], and the other P[i] points to the subtree whose key belongs to (K[i-1], K[i]);

8. All leaf nodes are located in the same layer;


This is just a simple B-tree. In practice, there are many keywords in the B-tree nodes. For example, in the figure above, node 35 represents a key (index), and the small black block represents the actual storage location in memory of the content pointed to by this key, which is a pointer.

5. B+ Tree

(1) Introduction

B+ tree is a variant of B tree generated by the file system (file directories are indexed level by level, and only the bottom leaf nodes (files) store data). Non-leaf nodes only store indexes, not actual data. Data is stored in leaf nodes. Isn't this how file system files are searched?

Let's take an example of file search: there are three folders a, b, and c, a contains b, b contains c, and a file yang.c. a, b, c are indexes (stored in non-leaf nodes), a, b, c are just the keys of yang.c to be found, and the actual data yang.c is stored on the leaf nodes.

All non-leaf nodes can be considered as index parts!

(2) Properties of B+ trees (the properties mentioned below are different from those of B trees)

1. The subtree pointer of non-leaf nodes is the same as the number of keywords;

2. The subtree pointer p[i] of the non-leaf node points to the subtree whose key value belongs to [k[i], k[i+1]]. (B-tree is an open interval, which means that B-tree does not allow key duplication, while B+ tree allows duplication);

3. Add a link pointer to all leaf nodes;

4. All keywords appear in leaf nodes (dense index). (And the keywords in the linked list happen to be in order);

5. Non-leaf nodes are equivalent to the index of leaf nodes (sparse index), and leaf nodes are equivalent to the data layer for storing (keyword) data;

6. More suitable for file systems;

A non-leaf node (such as 5, 28, 65) is just a key (index). The actual data stored in the leaf nodes (5, 8, 9) is the real data or a pointer to the real data.

(3) Application

1. B and B+ trees are mainly used for indexing in file systems and databases, such as MySQL;

6. B/B+ Tree Performance Analysis

The height of a balanced binary tree of n nodes is H (i.e. logn), while the height of a B/B+ tree of n nodes is logt((n+1)/2)+1;

As an in-memory lookup table, a B-tree is not necessarily better than a balanced binary tree, especially when m is large. Because the CPU time for search operations on B-trees is O(mlogtn)=O(lgn(m/lgt)), and m/lgt>1; so when m is large, O(mlogtn) is much longer than the operation time of a balanced binary tree. Therefore, when using B-tree in memory, a smaller m must be used. (Usually the minimum value m=3 is taken. At this time, each internal node in the B-tree can have 2 or 3 children. This third-order B-tree is called a 2-3 tree).

7. Why is B+ tree more suitable for database index than B tree?

1. The disk read and write cost of B+ tree is lower: the internal nodes of B+ tree do not have pointers to the specific information of keywords, so its internal nodes are smaller than those of B-tree. If all keywords of the same internal node are stored in the same disk block, the disk block can accommodate more keywords, and more keywords need to be searched at one time in the memory, so the relative IO read and write times are reduced.

2. The query efficiency of B+ tree is more stable: since the non-terminal point is not the node that ultimately points to the file content, but only the index of the keyword in the leaf node. Therefore, any keyword search must take a path from the root node to the leaf node. The path lengths of all keyword queries are the same, resulting in comparable query efficiency for each piece of data.

3. Since the data of the B+ tree is stored in the leaf nodes and the branch nodes are indexes, it is convenient to scan the database. You only need to scan the leaf nodes once. However, because the branch nodes of the B-tree also store data, we need to perform an in-order traversal to scan in order to find specific data. Therefore, the B+ tree is more suitable for interval queries, so B+ trees are usually used for database indexes.

PS: I saw someone say this on Zhihu, and I think it makes sense:

They believe that the main reason for using B+ trees in database indexes is that while B-trees improve IO performance, they do not solve the problem of inefficient element traversal. It is precisely to solve this problem that B+ trees were created. The B+ tree only needs to traverse the leaf nodes to traverse the entire tree. Moreover, range-based queries are very frequent in databases, but B-trees do not support such operations or are too inefficient.

Summarize

The above is the full content of this article. I hope that the content of this article will have certain reference learning value for your study or work. Thank you for your support of 123WORDPRESS.COM. If you want to learn more about this, please check out the following links

You may also be interested in:
  • What are the benefits of using B+ tree as index structure in MySQL?
  • What are the advantages of using B+ tree index in MySQL?
  • Analysis of the reasons why MySQL's index system uses B+ tree
  • The reason why MySQL uses B+ tree as its underlying data structure
  • Detailed explanation of the difference between B-tree index and B+ tree index in MySQL
  • Detailed explanation of MySQL B+ tree index and hash index
  • An article to understand why the MySQL index data structure uses B+ tree

<<:  Detailed explanation of the differences between js array find, some, filter, and reduce

>>:  Win10 uses Tsinghua source to quickly install pytorch-GPU version (recommended)

Recommend

MySQL master-slave configuration study notes

● I was planning to buy some cloud data to provid...

Vue implements an Input component that gets the key display shortcut key effect

I encountered a requirement to customize shortcut...

Linux uses bond to implement dual network cards to bind a single IP sample code

In order to provide high availability of the netw...

Summary of web design experience and skills

■ Website theme planning Be careful not to make yo...

Simple implementation of vue drag and drop

This article mainly introduces the simple impleme...

JavaScript to implement dynamic digital clock

This article shares the specific code for impleme...

Vue implements the shake function (compatible with ios13.3 and above)

Recently, I made a function similar to shake, usi...

Summary of several key points about mysql init_connect

The role of init_connect init_connect is usually ...

Learn MySQL index pushdown in five minutes

Table of contents Preface What is index pushdown?...

A comprehensive understanding of Vue.js functional components

Table of contents Preface React Functional Compon...

Example code for implementing a simple search engine with MySQL

Table of contents Preface Introduction ngram full...