In the field of data analysis, database is our good helper. Not only can we accept our query time, but we can also do further analysis based on it. Therefore, we must insert data into the database. In practical applications, we often encounter data volumes of tens of millions or even larger. If there is no quick insertion method, it will be ineffective and take a lot of time. When I participated in Alibaba's Tianchi Big Data Algorithm Competition (popular music trend prediction), I encountered such a problem. Before optimizing database query and insertion, I wasted a lot of time. Before optimization, it took an incredible 12 hours just to insert 15 million data items (using the most basic one-by-one insertion). This also prompted me to think about how to optimize database insertion and query operations to improve efficiency. In the process of continuous optimization, the performance has been greatly improved. In the process of querying and aggregating the download, play, and favorite numbers of more than 26,000 songs from the database in time series, the query generation operation speed was reduced from the estimated 40 hours to just over an hour. In terms of database insertion, the performance has been greatly improved. When tested on a new data set, more than 54.9 million pieces of data were inserted in 20 minutes. Let me share my thoughts below. The optimization process is divided into two steps. The first step is to use the experimental static reader to read data from the CSV file. When the data reaches a certain amount, the multi-threaded insertion into the database program is started. The second step is to use MySQL batch insertion operations. The first step is to read the file and start inserting multithreading Here, reaching a certain amount is a question that needs to be considered. In my experiment, I started using 100w as this amount, but a new problem emerged, the Java heap memory overflowed, and finally 10W was used as the standard. Of course, there can be other quantities, depending on what you like. import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; import java.util.ArrayList; import java.util.List; import preprocess.ImportDataBase; public class MuiltThreadImportDB { /** * Java multi-threaded reading of large files and storage * * @param args */ private static int m_record = 99999; private static BufferedReader br = null; private ArrayList<String> list; private static int m_thread = 0; static { try { br = new BufferedReader( new FileReader( "E:/tianci/IJCAI15 Data/data_format1/user_log_format1.csv"),8192); } catch (FileNotFoundException e) { e.printStackTrace(); } try { br.readLine(); // Remove CSV Header } catch (IOException e) { e.printStackTrace(); } } public void start() { String line; int count = 0; list = new ArrayList<String>(m_record + 1); synchronized (br) { try { while ((line = br.readLine()) != null) { if (count < m_record) { list.add(line); count++; } else { list.add(line); count = 0; Thread t1 = new Thread(new MultiThread(list),Integer.toString(m_thread++)); t1.start(); list = new ArrayList<String>(m_record + 1); } } if (list != null) { Thread t1 = new Thread(new MultiThread(list),Integer.toString(m_thread++)); t1.start(); } } catch (IOException e) { e.printStackTrace(); } } } public static void main(String[] args) { new MuiltThreadImportDB().start(); } } The second step is to use multithreading to insert data in batches class MultiThread implements Runnable { private ArrayList<String> list; public MultiThread(ArrayList<String> list) { this.list = list; } public void run() { try { ImportDataBase insert = new ImportDataBase(list); insert.start(); } catch (FileNotFoundException e) { e.printStackTrace(); } display(this.list); } public void display(List<String> list) { // for (String str : list) { // System.out.println(str); // } System.out.print(Thread.currentThread().getName() + " :"); System.out.println(list.size()); } } In batch operations, the prepareStatement class of MySQL is used, and of course the batch operations of the statement class are also used, but the performance is not as good as the former. The former can reach an insertion speed of 10,000+ per second, while the latter can only reach 2,000+; public int insertUserBehaviour(ArrayList<String> sqls) throws SQLException { String sql = "insert into user_behaviour_log (user_id,item_id,cat_id,merchant_id,brand_id,time_stamp,action_type)" + " values(?,?,?,?,?,?,?)"; preStmt = conn.prepareStatement(sql); for (int i = 0; i < sqls.size(); i++) { UserLog log = new UserLog(sqls.get(i)); preStmt.setString(1, log.getUser_id()); preStmt.setString(2, log.getItem_id()); preStmt.setString(3, log.getCat_id()); preStmt.setString(4, log.getMerchant_id()); preStmt.setString(5, log.getBrand_id()); preStmt.setString(6, log.getTimeStamp()); preStmt.setString(7, log.getActionType()); preStmt.addBatch(); if ((i + 1) % 10000 == 0) { preStmt.executeBatch(); conn.commit(); preStmt.clearBatch(); } } preStmt.executeBatch(); conn.commit(); return 1; } Of course, we also experimented with different MySQL storage engines, InnoDB and MyISM. The experimental results showed that InnoDB is faster (about 3 times), which may be related to the new version of MySQL. The author's MySQL version is 5.6. Finally, let’s summarize the methods to improve the insertion speed under large amounts of data. For Java code, use multi-threaded insertion and batch submission. In terms of database, do not use indexes when establishing the table structure, otherwise the index B+ tree will have to be maintained during the insertion process; modify the storage engine, generally the default is InnoDB (the new version can use the default, but the old version may require it). The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM. You may also be interested in:
|
<<: Two practical ways to enable proxy in React
>>: How to install JDK and Mysql on Ubuntu 18.04 Linux system
introduction Have you ever encountered a situatio...
Open the cpanel management backend, under the &qu...
Preface Slow system calls refer to system calls t...
1. Preparation Example: Two machines: 192.168.219...
concept MMM (Master-Master replication manager fo...
Read uncommitted example operation process - Read...
About Event: MySQL 5.1 began to introduce the con...
Preface In Linux kernel programming, you will oft...
1|0MySQL (MariaDB) 1|11. Description MariaDB data...
Chapter 1: Introduction to keepalived The purpose...
In Linux, everything is a file, so the Android sy...
Table of contents 1. Some concepts of Tomcat –1, ...
Installing MySQL 5.7 from TAR.GZ on Mac OS X Comp...
systemd: The service systemctl script of CentOS 7...
Use CSS3 to achieve cool radar scanning pictures:...