Project BackgroundDuring the processing, field A needs to be updated this morning. In the afternoon, the crawler team completes the crawling of specifications or pictures and needs to update the pictures and specification fields. Due to the deep page flipping of tens of millions of pages in a single table, the processing speed will become slower and slower. select a,b,c from db.tb limit 10000 offset 9000000 But time is limited. Is there a better way to solve this problem? Improvement ideas Is there any way to update data without deep page turning? Observe data characteristicsThis single table has an auto-incrementing ID column and is the primary key. The ideal way to query and update data is based on the index column. select a,b,c from db.tb where id=9999999; update db.tb set a=x where id=9999999; Multi-processing Each process processes data within a certain ID range, which avoids deep page flipping and allows multiple processes to process data at the same time. def mission_handler(all_missions, worker_mission_size): """ The task list is calculated based on the total number of tasks and the number of tasks of each worker. The task list elements are (task start id, task end id). Example: The total number of tasks is 100, and the number of tasks for each worker is 40. Then the task list is: [(1, 40), (41, 80), (81, 100)] :param all_missions: total number of missions :param worker_mission_size: maximum number of missions for each worker :return: [(start_id, end_id), (start_id, end_id), ...] """ worker_mission_ids = [] current_id = 0 while current_id <= all_missions: start_id = all_missions if current_id + 1 >= all_missions else current_id + 1 end_id = all_missions if current_id + worker_mission_size >= all_missions else current_id + worker_mission_size if start_id == end_id: if worker_mission_ids[-1][1] == start_id: break worker_mission_ids.append((start_id, end_id)) current_id += worker_mission_size return worker_mission_ids Assume that the maximum value of a single table ID is 100, and we want each process to process 20 IDs, then the task list will be: >>> mission_handler(100, 40) [(1, 40), (41, 80), (81, 100)] So, from concurrent.futures import ProcessPoolExecutor def main(): # Maximum value of the auto-increment id max_id = 30000000 # Data volume processed by a single worker worker_mission_size = 1000000 # Use multiple processes to process missions = mission_handler(max_id, worker_mission_size) workers = [] executor = ProcessPoolExecutor() for idx, mission in enumerate(missions): start_id, end_id = mission workers.append(executor.submit(data_handler, start_id, end_id, idx)) def data_handler(start_id, end_id, worker_id): pass Summary of ideas
Data processing skillsRecord the data IDs of successful and failed processing for subsequent follow-up processing # Use another table to record the processing status insert into db.tb_handle_status(row_id, success) values (999, 0); Exception capture is performed within the loop to prevent the program from exiting abnormally def data_handler(start_id, end_id, worker_id): #Data connection conn, cursor = mysql() current_id = start_id try: while current_id <= end_id: try: # TODO data processing code pass except Exception as e: # TODO record processing results# data moves to the next current_id += 1 continue else: # No exception, continue to process the next data current_id += 1 except Exception as e: return 'worker_id({}): result({})'.format(worker_id, False) finally: # Database resource release cursor.close() conn.close() return 'worker_id({}): result({})'.format(worker_id, True) Update database data using batch submission as much as possible sql = """update db.tb set a=%s, b=%s where id=%s""" values = [ ('a_value', 'b_value', 9999), ('a_value', 'b_value', 9998), ... ] # Batch submission to reduce network io and lock acquisition frequency cursor.executemany(sql, values) The above is the detailed content of the idea of processing tens of millions of data in a single MySQL table. For more information about processing tens of millions of data in a single MySQL table, please pay attention to other related articles on 123WORDPRESS.COM! You may also be interested in:
|
<<: Analysis and solutions to problems encountered in the use of label tags
>>: How to open a page in an iframe
Table of contents The background is: What will ha...
This tutorial shares the installation and configu...
Preface: After the automation is written, it need...
Click here to return to the 123WORDPRESS.COM HTML ...
Environment: (docker, k8s cluster), continue with...
<br />Previous article: Web Design Tutorial ...
There are very complex HTML structures in web pag...
1. Introduction pt-query-digest is a tool for ana...
This article shares the specific code for the WeC...
The Linux operating system has revolutionized the...
An n-digit verification code consisting of number...
Let's take an example: The code is very simple...
Cause: NVIDIA graphics card driver is damaged Sol...
add -it docker run -it -name test -d nginx:latest...
Preface For tree-structured data in the database,...