In the project, we often encounter the problem of importing large amounts of data into the database in order to use SQL to perform data analysis. In the process of importing data, you will encounter some problems that need to be solved. Here, combined with the practice of importing a txt data of about 4G, the problems encountered and their solutions are presented. On the one hand, I will make a summary record for myself, and on the other hand, I hope it will be a reference for friends who encounter the same problems. The data I imported is a txt file of Encyclopedia, the file size is more than 4G, there are more than 65 million data, each data is separated by a line break. Each piece of data contains three fields, which are separated by Tab. The method I use to retrieve the data is to use a TripleData class to store these three fields. All fields use String, then store multiple data in List<TripleData>, and then store List<TripleData> in MySQL database, storing all data in MySQL database in batches. The above is a general idea. The following are the problems encountered during the specific import process. 1. Database connection garbled characters and compatibility issues. If there is Chinese in the data, be sure to set the encoding parameters of the URL linking the database. The URL should be set as follows. URL="jdbc:mysql://"+IP+":"+PORT+"/"+DB_NAME+"?useSSL=false&useUnicode=true&characterEncoding=utf-8"; Setting the encoding to UTF-8 solves the garbled character problem, and setting useSSL solves the compatibility problem between JDBC and MySQL. If useSSL is not set, an error will be reported. Similar to Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. Such error message. The main reason is that the MySQL version is relatively high and the JDBC version is relatively low, so compatibility is required. 2 utf8mb4 encoding problem In the process of importing data, you will encounter similar SQLException : Incorrect string value: '\xF0\xA1\x8B\xBE\xE5\xA2...' for column 'name' This error message is because the UTF-8 set in MySQL is 3 bytes by default, which is no problem for general data. However, if the data volume is large, it will inevitably contain some WeChat emoticons or special characters, which occupy 4 bytes and cannot be processed by UTF-8, so an error is reported. The solution is that MySQL introduced 4-byte UTF-8 encoding, namely utf8mb4, in versions after 5.5.3, and the MySQL encoding needs to be reset. You can follow the steps below. First, back up the database to be modified. Although utf8mb4 is backward compatible with utf8, in order to prevent improper operation, it is still necessary to take precautions and make backups. The second is to change the character set encoding of the database to utf8mb4—UTF-8 Unicode, and the sorting rule to utf8mb4_general_ci. I used navicat to make the above modifications. You can find out how to modify them using the command line. The third is to modify the configuration file my.ini in the root directory of MySQL installation. Add the following settings. [client] default-character-set = utf8mb4 [mysqld] character-set-server=utf8mb4 collation-server=utf8mb4_general_ci [mysql] default-character-set = utf8mb4 After the modification is completed, you need to restart MySQL for the modification to take effect. Then import the data and it should be imported normally. 3 Time efficiency issues for large-scale import Since our data volume is relatively large, we segmented the data. I divided the 65 million data into 500 files, each with about 110,000 data items. I put these 110,000 data items into ArrayList<TripleObject> and then imported them in batches. The general idea is to use the "insert into tb (...) values(...),(...)...;" method and use insert to insert at one time, which will save a lot of time. An example method is as follows. public static void insertSQL(String sql,List<TripleObject> tripleObjectList) throws SQLException{ Connection conn=null; PreparedStatement psts=null; try { conn = DriverManager.getConnection(Common.URL, Common.DB_USERNAME, Common.DB_PASSWORD); conn.setAutoCommit(false); // Set manual commit // Save sql suffix StringBuffer suffix = new StringBuffer(); int count = 0; psts=conn.prepareStatement(""); String s=""; String p=""; String o=""; while (count<tripleObjectList.size()) { s=tripleObjectList.get(count).getSubject().replaceAll(",", ".").replaceAll("\\(", "").replaceAll("\\)", "").replaceAll("\'", "").replaceAll("\\\\", ""); p=tripleObjectList.get(count).getPredicate().replaceAll(",", ".").replaceAll("\\(", "").replaceAll("\\)", "").replaceAll("\'", "").replaceAll("\\\\", ""); o=tripleObjectList.get(count).getObject().replaceAll(",", ".").replaceAll("\\(", "").replaceAll("\\)", "").replaceAll("\'", "").replaceAll("\\\\", ""); suffix.append("('" +s +"','"+p+"','"+ o+"'),"); count++; } // Build complete SQL String allsql = sql + suffix.substring(0, suffix.length() - 1); // Add execution SQL psts.addBatch(allsql); psts.executeBatch(); // Execute batch processing conn.commit(); // Commit } catch (Exception e) { e.printStackTrace(); }finally{ if(psts!=null){ psts.close(); } if(conn!=null){ conn.close(); } } } The advantage of this method is that it takes very little time to import data. It took exactly one hour to import 65 million pieces of data. The disadvantage is that if there is a long sentence in the data, the commas, brackets, backslashes, etc. in it need to be processed. Here you need to weigh whether to use this method. If you insert data normally, that is, use the form of "insert into tb (...) values(...);insert into tb (...) values(...);...", you don't need to deal with special symbols, but it will take a long time. I tested it and it took about 12 minutes to import 110,000 records, and about 100 hours to import 65 million records. We use the first method, which only requires the data to be reviewed roughly, and does not have strict requirements on the data, thus saving time. The above are the problems I encountered when importing large amounts of data into MySQL, and the solutions I thought of. If you have a better solution or encounter other problems, I hope to discuss it together. You may also be interested in:
|
<<: How to configure Nginx's anti-hotlinking
>>: JS uses canvas technology to imitate echarts bar chart
For example, users who need screen reading softwar...
Hello everyone, today when I was looking at the H...
Environment: (docker, k8s cluster), continue with...
Table of contents 1. The significance of users an...
Examples: Through the PHP background code, you ca...
Table of contents Brief description: 1. Four char...
Everyone must know the composition of the box mod...
HTML is a hybrid language used for publishing on ...
1. Build the basic image of jmeter The Dockerfile...
In MySQL, database garbled characters can general...
Table of contents Overview Blob Blob in Action Bl...
Table of contents Proxy forwarding rules The firs...
1. Abnormal performance of Docker startup: 1. The...
This article describes the sql_mode mode in MySQL...
1. Installation Environment Computer model: Lenov...