How to use partitioning to optimize MySQL data processing for billions of data

How to use partitioning to optimize MySQL data processing for billions of data

When MySQL queries tens of millions of data, most query optimization problems can be solved through indexes. But when dealing with hundreds of millions of data, indexes are not so friendly.

The data table (log) looks like this:

  • Table size: 1T, about 2.4 billion rows;
  • Table partitioning: partition by time, with each month as one partition, and one partition has about 200-300 million rows of data (about 40-70G).

Since the data does not need to be processed in full, after discussions with the demand side, we sampled a portion of the data by time period, such as sampling one month's data, which is about 350 million rows.
Data processing ideas:

1) Select Innodb as the table engine. Since the data is partitioned by month, we copy the data of the monthly partition separately. The source table is the MyISAM engine. Because we may need to filter some data, and the fields involved in the filter have no index, the speed of adding indexes using the MyISAM engine will be slow.
2) Partition by day. After adding indexes to the copied table (about 2-4 hours), filter out useless data, generate a new table again, extract the required fields in JSON, and partition the table by day.

CREATE TABLE `tb_name` (
  `id_`,
  ...,
  KEY `idx_1` (`create_user_`) 
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Application log'
PARTITION BY RANGE(to_days(log_time_)) (
    PARTITION p1231 VALUES LESS THAN (737425),
    PARTITION p0101 VALUES LESS THAN (737426),
    PARTITION p0102 VALUES LESS THAN (737427),
    PARTITION p0103 VALUES LESS THAN (737428),
    PARTITION p0104 VALUES LESS THAN (737429),
......
);

3) Perform daily aggregation or other operations on the table generated above, and store the results in a temporary table. Try to use stored procedures to process data. Since the processing is relatively complex and time-consuming (running a stored procedure takes about 1-2 hours), the operation time and parameters during the execution process should be recorded when calling the stored procedure cyclically;

delimiter $$
create procedure proc_name(param varchar(50))
begin
 declare start_date date;
    declare end_date date;
    set start_date = '2018-12-31';
    set end_date = '2019-02-01';
    
    start transaction;
 truncate tmp_talbe;
 commit;
    
    while start_date < end_date do
  set @partition_name = date_format(start_date, '%m%d');
        set @start_time = now(); -- Record the start time of the current partition operation start transaction;
  set @sqlstr = concat(
   "insert into tmp_talbe",
   "select field_names ",
            "from tb_name partition(p", @partition_name,") t ",
            "where conditions;"
   );
  -- select @sqlstr;
  prepare stmt from @sqlstr;  
  execute stmt;
  deallocate prepare stmt;
  commit;
        
        -- Insert log set @finish_time = now(); -- Operation end time insert into oprerate_log values(param, @partition_name, @start_time, @finish_time, timestampdiff(second, @start_time, @finish_time));
        
  set start_date = date_add(start_date, interval 1 day);
    end while;
end
$$
delimiter ;

4) Sort and process the above generated results.

In general, the processing is relatively cumbersome and generates many intermediate tables. For key steps, metadata of the operation process also needs to be recorded, which places high demands on SQL processing. Therefore, it is not recommended to use MySQL to handle this task (unless absolutely necessary). If the processing can be placed on a big data platform, the speed will be faster and the metadata management will be relatively professional.

This is the end of this article on how to use partitions to handle MySQL's billion-level data optimization. For more relevant MySQL billion-level data optimization content, please search 123WORDPRESS.COM's previous articles or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • MySQL data insertion optimization method concurrent_insert
  • MySQL optimization query_cache_limit parameter description
  • 4 ways to optimize MySQL queries for millions of data
  • MySQL optimization: how to write high-quality SQL statements
  • Help you quickly optimize MySQL

<<:  Analysis of the situation where js determines and informs the support of CSS attributes (values)

>>:  Analysis of the causes of accidents caused by Unicode signature BOM

Recommend

Tutorial on installing mysql5.7.17 via yum on redhat7

The RHEL/CentOS series of Linux operating systems...

Detailed explanation of 10 common HTTP status codes

The HTTP status code is a 3-digit code used to in...

Teach you how to monitor Tomcat's JVM memory through JConsoler

Table of contents 1. How to monitor Tomcat 2. Jav...

jQuery implements HTML element hiding and display

Let's imitate Taobao's function of displa...

Linux Disk Quota Management Graphical Example

Disk quota is the storage limit of a specified di...

Nginx restricts IP access to certain pages

1. To prohibit all IP addresses from accessing th...

Javascript tree menu (11 items)

1. dhtmlxTree dHTMLxTree is a feature-rich Tree M...

A brief discussion on JavaScript scope

Table of contents 1. Scope 1. Global scope 2. Loc...

Solution to the problem that docker logs cannot be retrieved

When checking the service daily, when I went to l...

Windows 10 1903 error 0xc0000135 solution [recommended]

Windows 10 1903 is the latest version of the Wind...

MySQL series 9 MySQL query cache and index

Table of contents Tutorial Series 1. MySQL Archit...

Building the User Experience

<br />Maybe you've just come into a comp...

JS realizes the automatic playback effect of pictures

This article shares the specific code of JS to ac...

Some points on using standard HTML codes in web page creation

The most common mistake made by many website desi...