Sharing ideas on processing tens of millions of data in a single MySQL table

Project Background
Improvement ideas

Observe data characteristics
Multi-processing

Summary of ideas

Data processing skills

Project Background

During the processing, field A needs to be updated this morning. In the afternoon, the crawler team completes the crawling of specifications or pictures and needs to update the pictures and specification fields. Due to the deep page flipping of tens of millions of pages in a single table, the processing speed will become slower and slower.

select a,b,c from db.tb limit 10000 offset 9000000

But time is limited. Is there a better way to solve this problem?

Improvement ideas

Is there any way to update data without deep page turning?
Yes, using the auto-increment id column

Observe data characteristics

This single table has an auto-incrementing ID column and is the primary key. The ideal way to query and update data is based on the index column.

select a,b,c from db.tb where id=9999999;
update db.tb set a=x where id=9999999;

Multi-processing

Each process processes data within a certain ID range, which avoids deep page flipping and allows multiple processes to process data at the same time.
Improving data query speed also improves data processing speed.
Here is the task allocation function I wrote for reference:

def mission_handler(all_missions, worker_mission_size):
    """
    The task list is calculated based on the total number of tasks and the number of tasks of each worker. The task list elements are (task start id, task end id).
    Example: The total number of tasks is 100, and the number of tasks for each worker is 40. Then the task list is: [(1, 40), (41, 80), (81, 100)]
    :param all_missions: total number of missions :param worker_mission_size: maximum number of missions for each worker :return: [(start_id, end_id), (start_id, end_id), ...]
    """
    worker_mission_ids = []
    current_id = 0
    while current_id <= all_missions:
        start_id = all_missions if current_id + 1 >= all_missions else current_id + 1
        end_id = all_missions if current_id + worker_mission_size >= all_missions else current_id + worker_mission_size
        if start_id == end_id:
            if worker_mission_ids[-1][1] == start_id:
                break
        worker_mission_ids.append((start_id, end_id))
        current_id += worker_mission_size

    return worker_mission_ids

Assume that the maximum value of a single table ID is 100, and we want each process to process 20 IDs, then the task list will be:

>>> mission_handler(100, 40)
[(1, 40), (41, 80), (81, 100)]

So,
Process 1 will only need to process data with id between 1 to 40;
Process 2 will only need to process data with id between 41 to 80;
Process 3 will only need to process data with ids between 81 to 100.

from concurrent.futures import ProcessPoolExecutor


def main():
    # Maximum value of the auto-increment id max_id = 30000000
    # Data volume processed by a single worker worker_mission_size = 1000000
    # Use multiple processes to process missions = mission_handler(max_id, worker_mission_size)
    workers = []
    executor = ProcessPoolExecutor()
    for idx, mission in enumerate(missions):
        start_id, end_id = mission
        workers.append(executor.submit(data_handler, start_id, end_id, idx))


def data_handler(start_id, end_id, worker_id):
    pass

Summary of ideas

Avoid deep page flipping and use auto-increment id to query data and data
Processing data using multiple processes

Data processing skills

Record the data IDs of successful and failed processing for subsequent follow-up processing

# Use another table to record the processing status insert into db.tb_handle_status(row_id, success) values (999, 0);

Exception capture is performed within the loop to prevent the program from exiting abnormally

def data_handler(start_id, end_id, worker_id):
    #Data connection conn, cursor = mysql()
    current_id = start_id
        try:
            while current_id <= end_id:
                try:
                    # TODO data processing code pass

                except Exception as e:
                    # TODO record processing results# data moves to the next current_id += 1
                    continue
                else:
                    # No exception, continue to process the next data current_id += 1
        except Exception as e:
            return 'worker_id({}): result({})'.format(worker_id, False)
        finally:
            # Database resource release cursor.close()
            conn.close()

        return 'worker_id({}): result({})'.format(worker_id, True)

Update database data using batch submission as much as possible

sql = """update db.tb set a=%s, b=%s where id=%s"""
values = [
            ('a_value', 'b_value', 9999),
            ('a_value', 'b_value', 9998),
            ...
         ]
# Batch submission to reduce network io and lock acquisition frequency cursor.executemany(sql, values)

The above is the detailed content of the idea of processing tens of millions of data in a single MySQL table. For more information about processing tens of millions of data in a single MySQL table, please pay attention to other related articles on 123WORDPRESS.COM!

You may also be interested in:

How to optimize MySQL tables with tens of millions of data?
Practical record of optimizing MySQL tables with tens of millions of data

<<: Analysis and solutions to problems encountered in the use of label tags

>>: How to open a page in an iframe

MySQL 5.6.28 installation and configuration tutorial under Linux (Ubuntu)

Solution to the problem that the div width is set to width:100% and then the padding or margin exceeds the parent element

Preface This article introduces how to use the ne...

Sharing ideas on processing tens of millions of data in a single MySQL table

Table of contents

Project Background

Improvement ideas

Observe data characteristics

Multi-processing

Summary of ideas

Data processing skills

MySQL 5.6.28 installation and configuration tutorial under Linux (Ubuntu)

Troubleshooting MySQL high CPU load issues

About the basic configuration tutorial of MySQL5.7.17 free installation version under Win10 (with pictures and text)

How to reset the root password in mysql8.0.12

A brief discussion on Flink's fault-tolerant mechanism: job execution and daemon

Detailed understanding and comparative analysis of servers Apache, Tomcat and Nginx

Share some uncommon but useful JS techniques

Component design specifications for WeChat mini-program development

The difference between traditional tomcat startup service and springboot startup built-in tomcat service (recommended)

Introduction to ufw firewall in Linux

Recommend

Analysis of the principles of Mysql dirty page flush and shrinking table space

Detailed tutorial on how to create a user in mysql and grant user permissions

Example code for implementing stacked carousel effect with HTML+CSS+JS

Detailed explanation of Vue element plus multi-language switching

Detailed explanation of eight methods to achieve CSS page bottom fixed

Solution to the problem that the div width is set to width:100% and then the padding or margin exceeds the parent element

Application examples of WeChat applet virtual list

CSS3 implements missing corner rectangle, folded corner rectangle and missing corner border

A brief discussion on MySQL index design principles and the differences between common indexes

Specific implementation methods of MySQL table sharding and partitioning

Network configuration of Host Only+NAT mode under VirtualBox

Simple encapsulation of axios and example code for use

A comprehensive summary of frequently used statements in MySQL (must read)

CentOS7.5 installation tutorial of MySQL

Summary of 7 reasons why Docker is not suitable for deploying databases