How to support full Unicode in MySQL/MariaDB

How to support full Unicode in MySQL/MariaDB

Never use utf8 in MySQL, and always use utf8mb4.

Introduction to utf8mb4

In MySQL/MariaDB, the utf8 character set is not a true implementation of Unicode, that is, it is not a true UTF-8 encoding because it supports characters of up to 3 bytes.
As a variable-length character encoding, UTF-8 supports up to 4 bytes. For some characters such as emoji, Chinese, Japanese, and Korean traditional characters, 4 bytes are required to fully store them. In other words, if you want to support all Unicode encodings, utf8 in MySQL/MariaDB is not enough.

The UTF-8 encoding can represent every symbol in the Unicode character set, ranging from U+000000 to U+10FFFF. That's 1,114,112 possible symbols. (Not all of these Unicode code points are assigned characters, but that doesn't stop UTF-8 from encoding them. Unicode actually maps characters to 1,112,064 code points in total.) UTF-8 is a variable-width encoding; it uses one to four (8-bit) bytes to encode each symbol. This saves space for storing low-code values.

And it can fully represent all characters.

UTF-8 is arguably the most popular encoding implementation of Unicode.

The real UTF-8 encoding implemented in MySQL/MariaDB is utf8mb4, which supports up to 4 bytes of storage.

There are a total of 1,048,576 possible code points that cannot be used. MySQL's utf8 only allows storage of 5.88% of all possible Unicode code points ((0x00FFFF + 1) / (0x10FFFF + 1)). Correct UTF-8 can encode 100% of all Unicode code points.

UTF8 byte count exceeded error

If utf8 is used, when storing 4-byte characters such as emoji, an error Error: 1366 is usually reported:

[Err] 1366 - Incorrect string value: '\xF0\x9F\x98\x93' for column 'xxx' at row xx

[Error] 1366 - Incorrect string value: '\x...' for column 'xxx' at row xx

utf8mb4 support

The utf8mb4 in MySQL/MariaDB is a supplement to the original utf8 that can only store characters of 3 bytes in size. It is a true UTF-8 encoding.

Supported since MySQL 5.5.3+.

Starting from MySQL 8.0, the default character set has changed to utf8mb4. That is, if you use MySQL 8.0 and above, you don't have to worry about the byte length issue.
The default character set for MariaDB 10.x is still latin1. Therefore, usually, it should be changed to utf8mb4.

The commonly used sorting rules corresponding to utf8mb4 are: utf8mb4_general_ci and utf8mb4_unicode_ci.

Set the default character encoding to utf8mb4 and the corresponding sorting rules.

View the current code

After logging in to MySQL/MariaDB, use SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%'; to view the character set and collation.

View the character set:

MariaDB [(none)]> show variables like 'character%';
+--------------------------+------------------------------------------------+
| Variable_name | Value |
+--------------------------+------------------------------------------------+
| character_set_client | gbk |
| character_set_connection | gbk |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | gbk |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | C:\Program Files\MariaDB 10.3n\share\charsets\ |
+--------------------------+------------------------------------------------+
8 rows in set (0.043 sec)

Since it is installed under Chinese Windows, it contains gbk.

Modify the MySQL/Mariadb configuration file and change the utf8 encoding to utf8mb4

  • In Linux, modify the /etc/my.cnf or /etc/my.cnf.d/server.cnf file and add the following content under [mysql], [mysqld], etc.:
  • In Windows, modify the data/my.ini file in the MySQL/Mariadb installation directory and add content under [mysql], [mysqld], etc.

my.cnf under Linux:

> vim /etc/my.cnf
#
# This group is read both by the client and the server
# use it for options that affect everything
#
[client-server]

[mysqld]
character-set-server = utf8mb4
init_connect = 'SET NAMES utf8mb4'
collation-server=utf8mb4_unicode_ci
character-set-client-handshake=FALSE

[client]
default-character-set=utf8mb4

[mysql]
default-character-set=utf8mb4
#
# include all files from the config directory
#
!includedir /etc/my.cnf.d

my.ini under Windows:

[mysqld]
datadir=C:/Program Files/MariaDB 10.3n/data
port=3306
character_set_server=utf8mb4
# Using unique option prefix 'character_set_client' is error-prone and can break in the future. Please use the full name 'character-set-client-handshake' instead.
character-set-client-handshake=utf8mb4  
# character_set_client=utf8mb4
collation-server=utf8mb4_unicode_ci
init_connect = 'SET NAMES utf8mb4'
skip-character-set-client-handshake=false
character_set_filesystem = binary
lower_case_table_names=2 

[mysqldump]
loose_character_set_client=utf8mb4

[mysql]
default-character-set=utf8mb4

[client]
port=3306
plugin-dir=C:/Program Files/MariaDB 10.3n/lib/plugin
default-character-set=utf8mb4

Several setting items are described:

  • character-set-client-handshake=FALSE can affect the result of collation_connection to utf8mb4_unicode_ci instead of utf8mb4_general_ci. Of course, using SET collation_connection = utf8mb4_unicode_ci or collation_connection = utf8mb4_unicode_ci also works.
  • SET NAMES indicates the character set used by the client connection, that is, the character set used to send SQL statements to the server. character-set-server Set the server character set. To use utf8mb4 correctly, you need to make sure your client, server, and connection are all set to utf8mb4.
  • All other default settings related to character sets, such as init_connect, are inherited from character-set-server, which means that it is not necessary to specify init-connect, character_set_client, character_set_results, character_set_connection, etc. separately. Therefore, init_connect in the above settings can be omitted.
  • The encoding value for the connection enables MySQL to correctly decode input and encode results. Otherwise it will be re-encoded internally.

The pure and concise settings for the utf8mb4 character set and sorting rules are as follows, and only 5 settings are required:

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

Restart MySQL/MariaDB

In Linux, use systemctl restart mariadb or systemctl restart mysqld. Must use restart (not reload).

The service started or restarted by earlier versions of MariaDB is named mysqld (mysqld.service). systemctl restart mysqld

Open "Services" in Windows, find the corresponding service of MariaDB or MySQL in the services, right-click and restart it.

Check the character set and collation again

After the above changes take effect after restart, check again, as shown below, it has become utf8mb4 and utf8mb4_unicode_ci.

MariaDB [(none)]> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
+--------------------------+--------------------+
| Variable_name | Value |
+--------------------------+--------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8 |
| collation_connection | utf8mb4_unicode_ci |
| collation_database | utf8mb4_unicode_ci |
| collation_server | utf8mb4_unicode_ci |
+--------------------------+--------------------+
10 rows in set (0.008 sec)

Explanation of character_set_filesystem and character_set_system

From the character set and sorting rules shown above, you can see that two items are not utf8mb4.

  • character_set_filesystem should be binary unless you are using a file system that supports multi-byte UTF-8 encoded characters in file names.
  • character_set_system is always utf8 and cannot be overridden.

Other references about character set settings

The above settings are already the correct ones to modify utf8mb4. Several possible related settings are listed below for possible reference (although they are rarely used):

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
explicit_defaults_for_timestamp = 1 # Posting it here as a tip to disable the Timestamp message, maybe it can help someone :)
character-set-client-handshake = 0 # FALSE
init_connect = 'SET character_set_system = utf8mb4'
init_connect = 'SET character_set_connection = utf8mb4'
init_connect = 'SET character_set_database = utf8mb4'
init_connect = 'SET character_set_results = utf8mb4'
init_connect = 'SET collation_database = utf8mb4_unicode_ci'
init_connect = 'SET collation_connection = utf8mb4_unicode_ci'
init_connect = 'SET NAMES utf8mb4'
character-set-server = utf8mb4
#character_set_client = utf8mb4
collation-server = utf8mb4_unicode_ci
collation_connection = utf8mb4_unicode_ci
collation_database = utf8mb4_unicode_ci

Multiple init_connect can also be set like this: init_connect = 'SET collation_connection = utf8mb4_unicode_ci,NAMES utf8mb4'. Unverified

The complete process of switching the character set of an existing database to utf8mb4

If the database is created with utf8mb4, there will be no such troubles! ! !

Step 1: Create a backup Create a backup of all databases on the server to be upgraded. Safety first!

Step 2: Upgrade MySQL server Upgrade MySQL server to v5.5.3+. When MySQL is upgraded to 8.0, there is no need to modify the MySQL server character set in step 5. The default is utf8mb4. MariaDB also needs to modify the character set.

Step 3: Modify the database, tables, and columns Change the character set and collation properties of the database, tables, and columns to use utf8mb4.

# For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
# For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# For each column:
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# (Don't copy and paste blindly! The exact statement depends on the column type, maximum length, and other properties. The above line is just an example for a `VARCHAR` column.)
# Or modefy statement ALTER TABLE table_name MODIFY column_name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

Step 4: Check the maximum length of column and index keys

This is probably the most tedious part of the entire upgrade process.

When converting from utf8 to utf8mb4, the maximum length of a column or index key in bytes is unchanged. Therefore, it is smaller in terms of characters, since the maximum length of a character is now four bytes instead of three.

For example, a TINYTEXT column can hold up to 255 bytes, which correlates to 85 three-byte characters or 63 four-byte characters. Suppose you have a TINYTEXT column that uses utf8 but must be able to contain more than 63 characters. Given this requirement, you can't convert this column to utf8mb4 unless you also change the data type to a longer type, such as TEXT — because if you try to fill it with four-byte characters, you'll only be able to enter 63 characters, but no more.
The same is true for index keys. The maximum index length for the InnoDB storage engine is 767 bytes, so for utf8 or utf8mb4 columns, you can index up to 255 or 191 characters, respectively. If you currently have indexes on utf8 columns that are longer than 191 characters, you will need to index a smaller number of characters when using utf8mb4. (As a result, I had to change some indexed VARCHAR(255) columns to VARCHAR(191).)

Step 5: Modify the connection, client, and server character sets

In your application code, set the connection character set to utf8mb4. This can be done by simply replacing SET NAMES utf8 with SET NAMES utf8mb4. At the same time, the collation rules must also be modified accordingly, for example, SET NAMES utf8 COLLATE utf8_unicode_ci becomes SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci.

Make sure to set both the client and server character sets.

The MySQL configuration file (/etc/my.cnf) contains the following:

[client]
default-character-set = utf8mb4

[mysql]
default-character-set = utf8mb4

[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

Step 6: Repair and optimize all tables
After upgrading your MySQL server and making the necessary changes above, make sure to repair and optimize all databases and tables. Otherwise you might get strange errors even though no error is thrown.
You can run the following MySQL query for each table you want to repair and optimize:

# For each table
REPAIR TABLE table_name;
OPTIMIZE TABLE table_name;

This can be easily done in one go using the command-line mysqlcheck utility:

$ mysqlcheck -u root -p --auto-repair --optimize --all-databases

This will prompt for the root user's password, after which all tables in all databases will be repaired and optimized.

refer to

The main reference is translated from: How to support full Unicode in MySQL databases, welcome to read the original text.

This is the end of this article about how to support full Unicode in MySQL/MariaDB. For more information about MySQL/MariaDB supporting Unicode, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • Solution to remote connection to MySQL error code 1130
  • mysql error 1130 hy000: Host ''localhost'' solution
  • Solution to MySQL connection exception and error 10061
  • Graphic tutorial on installing MySQL database and configuring environment variables on Mac
  • Design and implementation of a student club management system based on JavaSwing+MySQL
  • MySQL multi-instance deployment and installation guide under Linux
  • Detailed process of deploying MySQL with docker (common applications deployed with docker)
  • MySQL 1130 exception, unable to log in remotely solution

<<:  Several ways to implement image adaptive container with CSS (summary)

>>:  Solution to the blank page after vue.js packaged project

Recommend

React+axios implements github search user function (sample code)

load Request Success Request failed Click cmd and...

Vue uniapp realizes the segmenter effect

This article shares the specific code of vue unia...

Multiple ways to insert SVG into HTML pages

SVG (Scalable Vector Graphics) is an image format...

Combining XML and CSS styles

student.xml <?xml version="1.0" enco...

Development details of Vue3 components

Table of contents 1. Introduction 2. Component De...

Docker image optimization (from 1.16GB to 22.4MB)

Table of contents The first step of optimization:...

30 Tips for Writing HTML Code

1. Always close HTML tags In the source code of p...

Detailed tutorial on running Tomcat in debug mode in IDEA Maven project

1. Add the following dependencies in pom.xml <...

JavaScript data flattening detailed explanation

Table of contents What is Flattening recursion to...

Use of Linux read command

1. Command Introduction The read command is a bui...

MySQL high concurrency method to generate unique order number

Preface After this blog post was published, some ...

A simple way to put HTML footer at the bottom of the page

Requirement: Sometimes, when the page content is ...

Research on the value of position attribute in CSS (summary)

The CSS position attribute specifies the element&...