Analysis of the problems and solutions encountered in importing large amounts of data into MySQL

Analysis of the problems and solutions encountered in importing large amounts of data into MySQL

In the project, we often encounter the problem of importing large amounts of data into the database in order to use SQL to perform data analysis. In the process of importing data, you will encounter some problems that need to be solved. Here, combined with the practice of importing a txt data of about 4G, the problems encountered and their solutions are presented. On the one hand, I will make a summary record for myself, and on the other hand, I hope it will be a reference for friends who encounter the same problems.

The data I imported is a txt file of Encyclopedia, the file size is more than 4G, there are more than 65 million data, each data is separated by a line break. Each piece of data contains three fields, which are separated by Tab. The method I use to retrieve the data is to use a TripleData class to store these three fields. All fields use String, then store multiple data in List<TripleData>, and then store List<TripleData> in MySQL database, storing all data in MySQL database in batches.

The above is a general idea. The following are the problems encountered during the specific import process.

1. Database connection garbled characters and compatibility issues.

If there is Chinese in the data, be sure to set the encoding parameters of the URL linking the database. The URL should be set as follows.

URL="jdbc:mysql://"+IP+":"+PORT+"/"+DB_NAME+"?useSSL=false&useUnicode=true&characterEncoding=utf-8";

Setting the encoding to UTF-8 solves the garbled character problem, and setting useSSL solves the compatibility problem between JDBC and MySQL. If useSSL is not set, an error will be reported. Similar to

Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.

Such error message. The main reason is that the MySQL version is relatively high and the JDBC version is relatively low, so compatibility is required.

2 utf8mb4 encoding problem

In the process of importing data, you will encounter similar

SQLException : Incorrect string value: '\xF0\xA1\x8B\xBE\xE5\xA2...' for column 'name'

This error message is because the UTF-8 set in MySQL is 3 bytes by default, which is no problem for general data. However, if the data volume is large, it will inevitably contain some WeChat emoticons or special characters, which occupy 4 bytes and cannot be processed by UTF-8, so an error is reported. The solution is that MySQL introduced 4-byte UTF-8 encoding, namely utf8mb4, in versions after 5.5.3, and the MySQL encoding needs to be reset.

You can follow the steps below. First, back up the database to be modified. Although utf8mb4 is backward compatible with utf8, in order to prevent improper operation, it is still necessary to take precautions and make backups. The second is to change the character set encoding of the database to utf8mb4—UTF-8 Unicode, and the sorting rule to utf8mb4_general_ci. I used navicat to make the above modifications. You can find out how to modify them using the command line. The third is to modify the configuration file my.ini in the root directory of MySQL installation. Add the following settings.

[client]
default-character-set = utf8mb4
[mysqld]
character-set-server=utf8mb4
collation-server=utf8mb4_general_ci
[mysql]
default-character-set = utf8mb4

After the modification is completed, you need to restart MySQL for the modification to take effect.

Then import the data and it should be imported normally.

3 Time efficiency issues for large-scale import

Since our data volume is relatively large, we segmented the data. I divided the 65 million data into 500 files, each with about 110,000 data items. I put these 110,000 data items into ArrayList<TripleObject> and then imported them in batches. The general idea is to use the "insert into tb (...) values(...),(...)...;" method and use insert to insert at one time, which will save a lot of time. An example method is as follows.

public static void insertSQL(String sql,List<TripleObject> tripleObjectList) throws SQLException{
    Connection conn=null;
    PreparedStatement psts=null;
    try {
      conn = DriverManager.getConnection(Common.URL, Common.DB_USERNAME, Common.DB_PASSWORD);
      conn.setAutoCommit(false); // Set manual commit // Save sql suffix StringBuffer suffix = new StringBuffer();
      int count = 0; 
      psts=conn.prepareStatement("");
      String s="";
      String p="";
      String o="";
      while (count<tripleObjectList.size()) {
        s=tripleObjectList.get(count).getSubject().replaceAll(",", ".").replaceAll("\\(", "").replaceAll("\\)", "").replaceAll("\'", "").replaceAll("\\\\", "");
        p=tripleObjectList.get(count).getPredicate().replaceAll(",", ".").replaceAll("\\(", "").replaceAll("\\)", "").replaceAll("\'", "").replaceAll("\\\\", "");
        o=tripleObjectList.get(count).getObject().replaceAll(",", ".").replaceAll("\\(", "").replaceAll("\\)", "").replaceAll("\'", "").replaceAll("\\\\", "");
        suffix.append("('" +s +"','"+p+"','"+ o+"'),");
        count++;
      }
      // Build complete SQL
      String allsql = sql + suffix.substring(0, suffix.length() - 1);
      // Add execution SQL
      psts.addBatch(allsql);
      psts.executeBatch(); // Execute batch processing conn.commit(); // Commit } catch (Exception e) {
      e.printStackTrace();
    }finally{
      if(psts!=null){
        psts.close();
      }
      if(conn!=null){
        conn.close();
      }
    }
  }

The advantage of this method is that it takes very little time to import data. It took exactly one hour to import 65 million pieces of data. The disadvantage is that if there is a long sentence in the data, the commas, brackets, backslashes, etc. in it need to be processed. Here you need to weigh whether to use this method.

If you insert data normally, that is, use the form of "insert into tb (...) values(...);insert into tb (...) values(...);...", you don't need to deal with special symbols, but it will take a long time. I tested it and it took about 12 minutes to import 110,000 records, and about 100 hours to import 65 million records.

We use the first method, which only requires the data to be reviewed roughly, and does not have strict requirements on the data, thus saving time.

The above are the problems I encountered when importing large amounts of data into MySQL, and the solutions I thought of. If you have a better solution or encounter other problems, I hope to discuss it together.

You may also be interested in:
  • A practical tutorial on importing data from sqlite3 into mysql
  • PHP method of importing cvs data into MySQL based on Fleaphp framework
  • A simple tutorial on data import and recovery in MySQL
  • How to import MSSQL data into MYSQL using PHP
  • How to use the load data command in MySQL to import data
  • How to parse csv data and import it into mysql
  • Graphical method to import data from Excel into MySql under phpMyAdmin
  • Detailed explanation of how to import pure IP data into MySQL in 3 steps
  • Implementation code of importing Excel data into Mysql database
  • How to import ACCESS data into MySQL
  • MYSQL big data import

<<:  How to configure Nginx's anti-hotlinking

>>:  JS uses canvas technology to imitate echarts bar chart

Recommend

How are spaces represented in HTML (what do they mean)?

In web development, you often encounter characters...

Docker removes abnormal container operations

This rookie encountered such a problem when he ju...

React dva implementation code

Table of contents dva Using dva Implementing DVA ...

Reasons why MySQL queries are slow

Table of contents 1. Where is the slowness? 2. Ha...

Detailed explanation of docker network bidirectional connection

View Docker Network docker network ls [root@maste...

Summary of commonly used CSS encapsulation methods

1. pc-reset PC style initialization /* normalize....

Detailed explanation of publicPath usage in Webpack

Table of contents output output.path output.publi...

Meta tags in simple terms

The META tag, commonly referred to as the tag, is...

A simple method to be compatible with IE6's min-width and min-height

If a website is widescreen, you drag the browser ...

Tutorial on deploying springboot package in linux environment using docker

Because springboot has a built-in tomcat server, ...

What is ssh? How to use? What are the misunderstandings?

Table of contents Preface What is ssh What is ssh...

Learn more about the most commonly used JavaScript events

Table of contents JavaScript events: Commonly use...