MySql quick insert tens of millions of large data examples

MySql quick insert tens of millions of large data examples

In the field of data analysis, database is our good helper. Not only can we accept our query time, but we can also do further analysis based on it. Therefore, we must insert data into the database. In practical applications, we often encounter data volumes of tens of millions or even larger. If there is no quick insertion method, it will be ineffective and take a lot of time.

When I participated in Alibaba's Tianchi Big Data Algorithm Competition (popular music trend prediction), I encountered such a problem. Before optimizing database query and insertion, I wasted a lot of time. Before optimization, it took an incredible 12 hours just to insert 15 million data items (using the most basic one-by-one insertion). This also prompted me to think about how to optimize database insertion and query operations to improve efficiency.

In the process of continuous optimization, the performance has been greatly improved. In the process of querying and aggregating the download, play, and favorite numbers of more than 26,000 songs from the database in time series, the query generation operation speed was reduced from the estimated 40 hours to just over an hour. In terms of database insertion, the performance has been greatly improved. When tested on a new data set, more than 54.9 million pieces of data were inserted in 20 minutes. Let me share my thoughts below.

The optimization process is divided into two steps. The first step is to use the experimental static reader to read data from the CSV file. When the data reaches a certain amount, the multi-threaded insertion into the database program is started. The second step is to use MySQL batch insertion operations.

The first step is to read the file and start inserting multithreading

Here, reaching a certain amount is a question that needs to be considered. In my experiment, I started using 100w as this amount, but a new problem emerged, the Java heap memory overflowed, and finally 10W was used as the standard.

Of course, there can be other quantities, depending on what you like.

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
 
import preprocess.ImportDataBase;
 
public class MuiltThreadImportDB {
 
 /**
  * Java multi-threaded reading of large files and storage * 
  * @param args
  */
 private static int m_record = 99999;
 private static BufferedReader br = null;
 private ArrayList<String> list;
 private static int m_thread = 0;
 static {
 try {
  br = new BufferedReader(
  new FileReader(
  "E:/tianci/IJCAI15 Data/data_format1/user_log_format1.csv"),8192);
 
 } catch (FileNotFoundException e) {
  e.printStackTrace();
 }
 try {
  br.readLine(); // Remove CSV Header
 } catch (IOException e) {
  e.printStackTrace();
 }
 }
 
 public void start() {
 String line;
 int count = 0;
 list = new ArrayList<String>(m_record + 1);
 synchronized (br) {
  try {
 while ((line = br.readLine()) != null) {
  if (count < m_record) {
 list.add(line);
 count++;
  } else {
 list.add(line);
 count = 0;
 Thread t1 = new Thread(new MultiThread(list),Integer.toString(m_thread++));
 t1.start();
 list = new ArrayList<String>(m_record + 1);
  }
 }
 
 if (list != null) {
  Thread t1 = new Thread(new MultiThread(list),Integer.toString(m_thread++));
  t1.start();
 }
  } catch (IOException e) {
 e.printStackTrace();
  }
 }
 }
 
 public static void main(String[] args) {
 new MuiltThreadImportDB().start();
 } 
}

The second step is to use multithreading to insert data in batches

class MultiThread implements Runnable {
 private ArrayList<String> list;
 
 public MultiThread(ArrayList<String> list) {
 this.list = list;
 }
 
 public void run() {
 try {
  ImportDataBase insert = new ImportDataBase(list);
  insert.start();
 } catch (FileNotFoundException e) {
  e.printStackTrace();
 }
 display(this.list);
 }
 
 public void display(List<String> list) {
 // for (String str : list) {
 // System.out.println(str);
 // }
 System.out.print(Thread.currentThread().getName() + " :");
 System.out.println(list.size());
 }
 
}

In batch operations, the prepareStatement class of MySQL is used, and of course the batch operations of the statement class are also used, but the performance is not as good as the former. The former can reach an insertion speed of 10,000+ per second, while the latter can only reach 2,000+;

public int insertUserBehaviour(ArrayList<String> sqls) throws SQLException {
 
 String sql = "insert into user_behaviour_log (user_id,item_id,cat_id,merchant_id,brand_id,time_stamp,action_type)"
 + " values(?,?,?,?,?,?,?)";
 preStmt = conn.prepareStatement(sql);
 for (int i = 0; i < sqls.size(); i++) {
  UserLog log = new UserLog(sqls.get(i));
  preStmt.setString(1, log.getUser_id());
  preStmt.setString(2, log.getItem_id());
  preStmt.setString(3, log.getCat_id());
  preStmt.setString(4, log.getMerchant_id());
  preStmt.setString(5, log.getBrand_id());
  preStmt.setString(6, log.getTimeStamp());
  preStmt.setString(7, log.getActionType());
  preStmt.addBatch();
  if ((i + 1) % 10000 == 0) {
 preStmt.executeBatch();
 conn.commit();
 preStmt.clearBatch();
  }
 }
 preStmt.executeBatch();
 conn.commit();
 return 1;
 }

Of course, we also experimented with different MySQL storage engines, InnoDB and MyISM. The experimental results showed that InnoDB is faster (about 3 times), which may be related to the new version of MySQL. The author's MySQL version is 5.6.

Finally, let’s summarize the methods to improve the insertion speed under large amounts of data.

For Java code, use multi-threaded insertion and batch submission.

In terms of database, do not use indexes when establishing the table structure, otherwise the index B+ tree will have to be maintained during the insertion process; modify the storage engine, generally the default is InnoDB (the new version can use the default, but the old version may require it).

The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM.

You may also be interested in:
  • Optimizing query speed of MySQL with tens of millions of data using indexes
  • MySQL loop inserts tens of millions of data
  • How to quickly paginate MySQL data volumes of tens of millions
  • Detailed explanation of several practical solutions for quickly deleting large amounts of data (tens of millions) in MySQL
  • Summary of SQL query optimization knowledge points for MySQL tens of millions of big data
  • How to quickly create tens of millions of test data in MySQL
  • Detailed explanation of 30 SQL query optimization techniques for MySQL tens of millions of large data
  • Optimizing the performance of paging query for MySQL with tens of millions of data
  • How to optimize MySQL fast paging for tens of millions of pages
  • Detailed explanation of MySQL database tens of millions of data query and storage

<<:  Two practical ways to enable proxy in React

>>:  How to install JDK and Mysql on Ubuntu 18.04 Linux system

Recommend

Briefly understand the MYSQL database optimization stage

introduction Have you ever encountered a situatio...

How to use Cron Jobs to execute PHP regularly under Cpanel

Open the cpanel management backend, under the &qu...

How to call the interrupted system in Linux

Preface Slow system calls refer to system calls t...

CentOS 7.x deployment of master and slave DNS servers

1. Preparation Example: Two machines: 192.168.219...

MySQL database implements MMM high availability cluster architecture

concept MMM (Master-Master replication manager fo...

MySQL detailed explanation of isolation level operation process (cmd)

Read uncommitted example operation process - Read...

How to automatically delete records before a specified time in Mysql

About Event: MySQL 5.1 began to introduce the con...

Introduction to container of() function in Linux kernel programming

Preface In Linux kernel programming, you will oft...

Install redis and MySQL on CentOS

1|0MySQL (MariaDB) 1|11. Description MariaDB data...

Keepalived implements Nginx load balancing and high availability sample code

Chapter 1: Introduction to keepalived The purpose...

How to use file writing to debug a Linux application

In Linux, everything is a file, so the Android sy...

Tomcat Server Getting Started Super Detailed Tutorial

Table of contents 1. Some concepts of Tomcat –1, ...

Detailed graphic and text instructions for installing MySQL 5.7.20 on Mac OS

Installing MySQL 5.7 from TAR.GZ on Mac OS X Comp...

How to add custom system services to CentOS7 systemd

systemd: The service systemctl script of CentOS 7...

CSS3 radar scan map sample code

Use CSS3 to achieve cool radar scanning pictures:...