I recently encountered a bug where I was trying to save a UTF-8 string in MariaDB encoded in "utf8" via Rails and I got a weird error:
I use UTF-8 encoding on the client, the server, and the database. Even the string " The crux of the problem is that MySQL's "utf8" is not actually true UTF-8. "utf8" only supports up to three bytes per character, while true UTF-8 supports up to four bytes per character. MySQL never fixed this bug, and in 2010 they released a character set called "utf8mb4" that circumvented this problem. Of course, they did not advertise the new character set widely (probably because this bug made them feel embarrassed), so that developers are still advised to use "utf8" on the Internet, but these suggestions are wrong. A brief summary is as follows: MySQL's "utf8mb4" is true "UTF-8". MySQL's "utf8" is a "proprietary encoding" that can encode only a few Unicode characters. I want to make it clear here: all MySQL and MariaDB users who are using "utf8" should switch to "utf8mb4" and never use "utf8" again. So what is encoding? What is UTF-8? We all know that computers use 0s and 1s to store text. For example, if the character "C" is stored as "01000011", the computer needs to go through two steps to display this character:
Similarly:
Almost all web applications use the Unicode character set because there is no reason to use anything else. The Unicode character set contains millions of characters. The simplest encoding is UTF-32, which uses 32 bits per character. This is easiest to do because computers have always viewed 32 bits as numbers, and computers are great at crunching numbers. But the problem is that this wastes too much space. UTF-8 can save space. In UTF-8, the character "C" only needs 8 bits, and some uncommon characters, such as "", need 32 bits. Other characters may use 16 or 24 bits. An article like this one, if encoded using UTF-8, takes up only about a quarter of the space of UTF-32. MySQL's "utf8" character set is incompatible with other programs. What it calls "utf8" may really be a bunch of... A brief history of MySQLWhy would MySQL developers disable "utf8"? We may be able to find the answer from the commit log. MySQL has supported UTF-8 since version 4.1, which was in 2003, and the UTF-8 standard (RFC 3629) used today came later. The old UTF-8 standard (RFC 2279) supports up to 6 bytes per character. On March 28, 2002, MySQL developers used RFC 2279 in the first preview release of MySQL 4.1. In September of the same year, they made an adjustment to the MySQL source code: "UTF8 now only supports sequences of up to 3 bytes." Who submitted this code? Why did he do that? The answer to this question is unknown. After migrating to Git (MySQL originally used BitKeeper), many of the names of committers in the MySQL codebase were lost. There are no clues in the mailing list from September 2003 that could explain this change. But I can try to guess. In 2002, MySQL made a decision: if users can ensure that each row of the data table uses the same number of bytes, then MySQL can make a big improvement in performance. To do this, the user needs to define the text columns as "CHAR". Each "CHAR" column always has the same number of characters. If the number of characters inserted is less than the defined number, MySQL will fill it with spaces. If the number of characters inserted exceeds the defined number, the excess will be truncated. When MySQL developers first experimented with UTF-8, they used 6 bytes per character, CHAR(1) used 6 bytes, CHAR(2) used 12 bytes, and so on. It should be said that their initial behavior was correct, but unfortunately this version has never been released. But it is written in the document and it is widely circulated. Everyone who understands UTF-8 agrees with what is written in the document. But it is clear that MySQL developers or manufacturers are worried that users will do these two things: Define the column using CHAR (nowadays, CHAR is an old hat, but back then, it was faster in MySQL, but that has not been the case since 2005). So the result is no winner. Users who expect a win-win in space and speed may find that when they use CHAR columns with "utf8" they actually use more space than expected and are slower than expected. And users who want correctness, when they use "utf8" encoding, cannot save characters like "". After this illegal character set was released, MySQL could not fix it without asking all users to rebuild their databases. Finally, MySQL was re-released in 2010 as "utf8mb4" to support true UTF-8. Why does this matter make people so crazy? Because of this problem, I was crazy for a whole week. I was fooled by "utf8" and spent a lot of time to find the bug. But I'm definitely not the only one, almost all articles on the web treat "utf8" as real UTF-8. "utf8" is just a proprietary character set, which brings us new problems that have not been solved. SummarizeIf you are using MySQL or MariaDB, do not use the "utf8" encoding, use "utf8mb4" instead. A guide is provided here for converting an existing database character encoding from "utf8" to "utf8mb4". **Original English text:**https://medium.com/@adamhooper/in-mysql-never-use-utf8-use-utf8mb4-11761243e434 You may also be interested in:
|
<<: HTML tutorial, understanding the optgroup element
>>: Docker and Docker-compose one-click installation tutorial (supports online and offline)
Six effectsImplementation Code html <h1>CSS...
This article shares the detailed steps of install...
Forgot your MySQL password twice? At first I did ...
I have been in contact with PHP for so long, but ...
Table of contents 1. Download 2. Deployment 1.Ngi...
Now if you want to use the video tag in a page, y...
Make a nice flip login and registration interface...
Box-sizing in CSS3 (content-box and border-box) T...
Table of contents 1. Write Webshell into outfile ...
The following command is often used: chmod 777 文件...
This article uses examples to illustrate the opti...
Table of contents 1. System Information 2. Shutdo...
Table of contents What is JSI What is different a...
Table of contents 1. Common function classificati...
1. Introduction The location instruction is the c...