Table of contents- Why do we need partitions?
- Partitioning strategy
- Partition Hazards
Why do we need partitions?
When faced with huge data tables, at least one thing is certain: the table is so large that we cannot do a full table scan every time we query it. At this time, the index cannot be used, or the index is of little significance, not to mention that the maintenance cost and space occupied by the index are very high. If you rely on indexes, it will result in a large amount of fragmented and low-density data, which will cause thousands of random I/O accesses during queries and cause downtime. In this case, generally only 1-2 indexes will be used, and no more. In this case, there are two possible options: the query must search sequentially from the specified portion of the table or the desired portion of data and its index must match the server's memory. It bears repeating: when storage space is too large, binary tree indexes will not work unless the index covers the entire query. The server needs to find an entire row of data in the data table and perform random I/O operations in a large space span, which will result in unacceptable query response time. Maintaining indexes (disk space, I/O operations) is also expensive. This is a problem that partitioning can solve. The key here is that partitioning is a primitive form of indexing that is low overhead and allows us to retrieve results from nearby data. In this case, we can scan adjacent data sequentially or load adjacent data into memory for retrieval. The reason why the partition has low load is that it does not have a pointer to the corresponding data row and does not need to be updated. Partitioning does not precisely divide data into rows, nor does it involve so-called data structures. In fact, partitioning is equivalent to classifying data. Partitioning strategy
For large data tables, there are two strategies for partitioning: - No index is used: When creating a data table, no index is added. Instead, partitions are used to locate the required data rows. As long as you use the WHERE condition to split the query into small partition ranges, it is sufficient. At this time, mathematical methods are needed to calculate whether the query response time is acceptable. Of course, the assumption here is that the data will not be put into memory, but all data is read from disk. Therefore the data will be quickly overwritten by other queries and there is little point in using cache. This situation is common for large cardinality of data tables. It should be noted that the number of partitions needs to be limited to a few hundred.
- Use indexes and isolate hot zone data: If most of the data except the hot zone data is not used, the hot zone data can be partitioned separately, and this partition and the index can be loaded into memory. At this time, you can use indexes to optimize performance, just like operating ordinary data tables.
Partition Hazards
The two partitioning strategies are based on two key assumptions: the search scope can be narrowed by filtering partitions during querying, and the cost of the partitions themselves is not high. However, these two assumptions may not always be valid. Here are some problems you may encounter: - NULL values may cause partition filtering to fail: when the partition function may be NULL, the results of the partitioning work will be very strange. It will assume that the first partition is special. Assuming that PARTITION BY RANGE YEAR(order_date) is used, if the order_date column is NULL or an invalid date, it will be stored in the first partition. Suppose you write a query using the following query condition: WHERE order_date BETWEEN '2021-01-01' AND '2021-01-31'. MySQL actually checks 2 partitions, one for YEAR which is a function that may return NULL when it receives invalid input, and one for values that qualify which may be NULL (stored in the first partition). This is also possible for other functions, such as TO_DAYS. This can cause problems if the first partition is large, especially when using the first strategy without indexes. The effect of looking up data from two partitions instead of one is completely unexpected. To avoid this, a "fake" first partition should be created, for example PARTITION p_nulls VALUES LESS THAN (0). If there is no invalid data stored in the data table, the first partition will be empty. Even though it will be scanned, it will have little impact on performance because it is empty or contains very little data. In MySQL 5.5 and later, this situation does not need to be handled if columns are used directly for partitioning, but it must be handled if functions are used.
- Index does not match the partition: If an index is defined that does not match the partition condition, the query may not be able to filter the partition. Suppose an index is defined on field a but field b is used for partitioning. Since each partition will have its own index, queries against this index will traverse the index tree of all partitions. If all non-leaf nodes of the index tree are stored in memory, the query will be faster, but there is no way to avoid scanning the entire index. To avoid this situation, you should try to avoid using non-partitioned index columns unless the WHERE condition itself can specify the partition. This seems easy to avoid, but it is actually surprising. For example, suppose a partitioned table is used in a join query with a second table, and the index used in the join query is not the partition index. Then each row of the union query will access and scan the partition of the second table.
- Deciding which partition to use can be expensive: partitioning is implemented in different ways, so actual performance is not always consistent. Especially when you encounter questions like "which partition does this data row belong to" or "how can I find the data row that matches the query conditions". It is difficult to answer such questions in the case of so many partitions. Linear search is not always efficient and as a result becomes more expensive as the number of partitions increases. The worst form is to insert row by row. Each time a row of data is inserted into a partitioned data table, the server needs to scan once to find which partition to use to store the new data row. This problem can be mitigated by limiting the number of partitions; in fact, it is generally not recommended to exceed 100 partitions. Of course, there is no such limitation for other partition types, such as key and hash partitions.
- Opening and locking partitions can also be expensive: One side effect of partitioned tables is that queries require opening and locking each partition. This process is performed before filtering the partitions. This cost is independent of the partition type and affects all operation statements. This effect is especially noticeable for queries with short data volumes, such as when querying only one row of data. This defect can be reduced by performing batch operations instead of single operations, such as inserting multiple rows at a time, or LOAD DATA INFILE, deleting data by range at a time, etc. Of course, limiting the number of partitions is also effective.
- Maintenance operations can be expensive: Some partition maintenance is quick, such as creating or deleting partitions. Other operations, such as adjusting partitions, are a bit like ALTER operations on tables: they require looping and copying data rows. For example, resizing a partition creates a temporary partition, moves data to the new partition, and then deletes the old partition.
As mentioned above, partitioning is not a perfect solution. The current version of MySQL has some other constraints: - All partitions must use the same storage engine.
- There are certain restrictions on the functions or expressions that can be used as partition functions.
- Some storage engines do not support partitioning.
- For MYISAM data tables, LOAD INDEX INTO CACHE cannot be used.
- For MYISAM data tables, partitioned tables require more open file descriptors, which means that a single data table cache entry may correspond to multiple file descriptors. Therefore, the basic configuration limits the cache of the data table to avoid exceeding the pre-processing amount of the server operating system, and partitioned tables may actually exceed this limit.
Of course, with the update and iteration of MySQL versions, the support for partitioning is getting better and better, and many partitioning problems have been fixed. The above is the detailed content of MySQL data table partitioning strategy and advantages and disadvantages analysis. For more information about MySQL data table partitioning strategy and advantages and disadvantages, please pay attention to other related articles on 123WORDPRESS.COM! You may also be interested in:- SQL implements addition, subtraction, multiplication and division operations on two adjacent rows of data
- Mysql method to calculate the difference between two adjacent rows of a column
- How to get adjacent data in MySql
|