What are the benefits of using B+ tree as index structure in MySQL?

What are the benefits of using B+ tree as index structure in MySQL?

Preface

In MySQL, both Innodb and MyIsam use B+ tree as the index structure (other indexes such as hash are not considered here). This article will start with the most common binary search tree, and gradually explain the problems solved by various trees and the new problems they face, thereby explaining why MySQL chooses the B+ tree as the index structure.

1. Binary Search Tree (BST): Unbalanced

A binary search tree (BST), also called a binary sorted tree, needs to satisfy the following conditions based on a binary tree: the values ​​of all nodes on the left subtree of any node are not greater than the value of the root node, and the values ​​of all nodes on the right subtree of any node are not less than the value of the root node. The following is a BST:

When fast lookup is required, storing data in a BST is a common choice, because the query time depends on the tree height and the average time complexity is O(lgn). However, the BST may grow crooked and become unbalanced, as shown in the following figure. At this time, the BST degenerates into a linked list and the time complexity degenerates to O(n).

To solve this problem, a balanced binary tree was introduced.

2. Balanced Binary Tree (AVL): Rotation Time-consuming

The AVL tree is a strictly balanced binary tree. The height difference between the left and right subtrees of all nodes cannot exceed 1. The search, insertion, and deletion of the AVL tree are O(lgn) in both the average and worst cases.

The key to achieving balance in AVL lies in the rotation operation: insertion and deletion may destroy the balance of the binary tree. In this case, one or more tree rotations are needed to rebalance the tree. When inserting data, only one rotation (single or double rotation) is required at most; however, when deleting data, the tree will be unbalanced. AVL needs to maintain the balance of all nodes on the path from the deleted node to the root node, and the order of rotation is O(lgn).

Due to the time-consuming rotation, the AVL tree is very inefficient when deleting data; when there are many deletion operations, the cost of maintaining balance may be higher than the benefits it brings, so AVL is not widely used in practice.

3. Red-black tree: the tree is too high

Compared with AVL trees, red-black trees do not pursue strict balance, but rough balance: they just ensure that the longest possible path from the root to a leaf is no longer than twice the shortest possible path. From the implementation point of view, the biggest feature of the red-black tree is that each node belongs to one of the two colors (red or black), and the division of node colors needs to meet specific rules (the specific rules are omitted). The following are examples of red-black trees:

Compared with the AVL tree, the query efficiency of the red-black tree will decrease because the tree is less balanced and has a higher height. However, the deletion efficiency of the red-black tree is greatly improved because the red-black tree also introduces color. When inserting or deleting data, only O(1) rotations and color changes are required to ensure basic balance, without the need for O(lgn) rotations like the AVL tree. In general, the statistical performance of red-black trees is higher than that of AVL.

Therefore, in practical applications, AVL trees are relatively rarely used, while red-black trees are very widely used. For example, TreeMap in Java uses red-black trees to store sorted key-value pairs; HashMap in Java8 uses linked lists + red-black trees to solve hash collision problems (when there are fewer conflicting nodes, linked lists are used, and when there are more conflicting nodes, red-black trees are used).

For situations where data is in memory (such as the above-mentioned TreeMap and HashMap), the performance of red-black trees is excellent. However, red-black trees are not good at handling data in auxiliary storage devices such as disks (such as databases such as MySQL) because they grow too high. When data is on disk, disk IO will become the biggest performance bottleneck, and the design goal should be to minimize the number of IO times. The higher the tree is, the more IO times are required for adding, deleting, modifying and checking, which will seriously affect performance.

4. B-tree: Born for disk

B-tree, also known as B-tree (with no minus sign in it), is a multi-way balanced search tree designed for auxiliary storage devices such as disks. Compared with a binary tree, each non-leaf node of the tree can have multiple subtrees. Therefore, when the total number of nodes is the same, the height of the B-tree is much smaller than that of the AVL tree and the red-black tree (the B-tree is a "short and fat man"), and the number of disk IO times is greatly reduced.

The most important concept in defining a B-tree is order. For an m-order B-tree, the following conditions must be met:

  • Each node contains at most m child nodes.
  • If the root node contains child nodes, it contains at least 2 child nodes; except the root node, each non-leaf node contains at least m/2 child nodes.
  • A non-leaf node with k children will contain k - 1 records.
  • All leaf nodes are in the same layer.

It can be seen that the definition of B-tree is mainly to limit the number of child nodes and records of non-leaf nodes.

The following figure is an example of a 3-order B-tree:

The advantage of B-tree is not only its small tree height, but also its utilization of the principle of locality of access. The so-called locality principle means that when a piece of data is used, the data nearby is more likely to be used in a short time. The B-tree stores data with similar keys in the same node. When a piece of data is accessed, the database reads the entire node into the cache. When the adjacent data is accessed next, it can be read directly from the cache without disk IO. In other words, the B-tree has a higher cache hit rate.

B-trees have some applications in databases. For example, MongoDB's index uses the B-tree structure. However, in many database applications, a variant of the B-tree, the B+ tree, is used.

5. B+ Tree

B+ tree is also a multi-way balanced search tree. The main differences between it and B tree are:

  • Each node in the B-tree (including leaf nodes and non-leaf nodes) stores real data, while only leaf nodes in the B+ tree store real data, and non-leaf nodes only store keys. In MySQL, the real data mentioned here may be all the data of the row (such as Innodb's clustered index), or just the primary key of the row (such as Innodb's auxiliary index), or the address of the row (such as MyIsam's non-clustered index).
  • A record in a B-tree will only appear once and will not appear repeatedly, while the key of a B+ tree may appear repeatedly - it will definitely appear in a leaf node and may also appear repeatedly in a non-leaf node.
  • The leaf nodes of the B+ tree are linked by a doubly linked list.
  • In a non-leaf node in a B-tree, the number of records is one less than the number of child nodes; while in a B+ tree, the number of records is the same as the number of child nodes.

Therefore, B+ tree has the following advantages over B tree:

  • **Fewer IO times: **Non-leaf nodes of the B+ tree contain only keys, not real data, so the number of records stored in each node is much larger than the B number (that is, the order m is larger), so the height of the B+ tree is lower and fewer IO times are required for access. In addition, since each node stores more records, the access locality principle is better utilized and the cache hit rate is higher.
  • **More suitable for range queries: **When performing a range query in a B-tree, first find the lower limit to be searched, and then perform an in-order traversal of the B-tree until the upper limit is found; while for a range query in a B+ tree, you only need to traverse the linked list.
  • **More stable query efficiency: **The query time complexity of the B-tree is between 1 and the tree height (corresponding to the records in the root node and leaf nodes respectively), while the query complexity of the B+ tree is stable at the tree height because all data is in the leaf nodes.

B+ trees also have disadvantages: since keys are repeated, they take up more space. However, compared with the performance advantages, the space disadvantage is often acceptable, so B+ trees are more widely used in databases than B trees.

6. Feel the power of B+ tree

As mentioned earlier, the biggest advantage of B-tree/B+ tree over binary trees such as red-black trees is that the tree height is smaller. In fact, for Innodb's B+ index, the tree height is generally 2-4 layers. Let's do some specific estimates.

The height of the tree is determined by the order. The larger the order, the shorter the tree. The order depends on how many records each node can store. Each node in Innodb uses a page with a page size of 16KB, of which metadata only occupies about 128 bytes (including file management header information, page header information, etc.), and most of the space is used to store data.

  • For non-leaf nodes, the record only contains the index key and a pointer to the next level node. Assuming that each non-leaf node page stores 1000 records, each record takes up approximately 16 bytes; this assumption is reasonable when the index is an integer or a short string. To extend this, we often hear suggestions that the index column length should not be too large. This is the reason: if the index column is too long, each node will contain too few records, which will cause the tree to be too tall, the index effect will be greatly reduced, and the index will waste more space.
  • For leaf nodes, the records contain the index key and value (the value may be the primary key of the row, a complete row of data, etc., see the previous article for details), and the amount of data is larger. Here it is assumed that each leaf node page stores 100 records (in fact, when the index is a clustered index, this number may be less than 100; when the index is a secondary index, this number may be much larger than 100; it can be estimated based on actual conditions).

For a 3-layer B+ tree, the first layer (root node) has 1 page, which can store 1000 records; the second layer has 1000 pages, which can store 1000 * 1000 records; the third layer (leaf node) has 1000 * 1000 pages, each page can store 100 records, so it can store 1000 * 1000 * 100 records, or 100 million records. For a binary tree, storing 100 million records requires about 26 layers.

VII. Conclusion

Finally, let’s summarize the problems solved by various trees and the new problems they face:

  1. Binary Search Tree (BST): It solves the basic problem of sorting, but because it cannot guarantee balance, it may degenerate into a linked list.
  2. Balanced binary tree (AVL): The balance problem is solved by rotation, but the efficiency of rotation operation is too low;
  3. Red-black tree: By abandoning strict balance and introducing red-black nodes, the problem of low AVL rotation efficiency is solved. However, in scenarios such as disks, the tree is still too high and the IO times are too high.
  4. B-tree: By changing the binary tree into a multi-way balanced search tree, the problem of the tree being too tall is solved;
  5. B+ tree: Based on the B-tree, non-leaf nodes are transformed into pure index nodes that do not store data, further reducing the height of the tree; in addition, leaf nodes are connected into a linked list using pointers, making range queries more efficient.

The above is the detailed content of the advantages of using B+ tree as the index structure of MySQL. For more information about MySQL B+ tree index structure, please pay attention to other related articles on 123WORDPRESS.COM!

You may also be interested in:
  • Why does MySQL database index choose to use B+ tree?
  • What are the advantages of using B+ tree index in MySQL?
  • Analysis of the reasons why MySQL's index system uses B+ tree
  • The reason why MySQL uses B+ tree as its underlying data structure
  • Detailed explanation of the difference between B-tree index and B+ tree index in MySQL
  • Detailed explanation of MySQL B+ tree index and hash index
  • An article to understand why the MySQL index data structure uses B+ tree

<<:  How to view image information in Docker

>>:  XHTML introductory tutorial: Use of list tags

Recommend

12 Useful Array Tricks in JavaScript

Table of contents Array deduplication 1. from() s...

Three ways to achieve text flashing effect in CSS3 Example code

1. Change the transparency to achieve the gradual...

Analysis and Solution of ERROR:2002 Reported When MySQL Starts

Preface This article mainly introduces the analys...

Solution to the ineffectiveness of flex layout width in css3

Two-column layout is often used in projects. Ther...

Detailed installation tutorial for MySQL zip archive version (5.7.19)

1. Download the zip archive version from the offi...

How to install Composer in Linux

1. Download the installation script - composer-se...

HTML table markup tutorial (28): cell border color attribute BORDERCOLOR

To beautify the table, you can set different bord...

Summary of using the reduce() method in JS

Table of contents 1. Grammar 2. Examples 3. Other...

MariaDB under Linux starts with the root user (recommended)

Recently, due to the need to test security produc...

Linux RabbitMQ cluster construction process diagram

1. Overall steps At the beginning, we introduced ...

Summary of events that browsers can register

Html event list General Events: onClick HTML: Mous...

CentOS 6 uses Docker to deploy Zookeeper operation example

This article describes how to use docker to deplo...