Detailed explanation of character sets and validation rules in MySQL

Detailed explanation of character sets and validation rules in MySQL

1Several common character sets

In MySQL, the most common character sets include ASCII character set, latin character set, GB2312 character set, GBK character set, UTF8 character set, etc. Let us briefly introduce these character sets:

ASCII character set

This character set uses 1 byte for encoding. One byte has 8 bits and can store a total of 128 characters. The specific correspondence is as follows:

Latin character set

The Latin character set can store a total of 256 characters. Compared with the ASCII code, it contains 128 commonly used characters in Western Europe.

GB2312 character set

It includes Chinese characters, Latin characters, Greek characters, etc., among which Chinese characters account for the majority, with 6763, and other text symbols are 638, and it is compatible with ASCII characters. When encoding ASCII, it uses 1 byte, which is 128 bits, and when encoding other characters, it uses 2 bytes. It can be understood that it is a side length encoding method.

GBK character set

This character set is an extension of the GB2312 character set. It is compatible with the GB2312 character set and is generally encoded using two bytes.

UTF8 character set

It is usually encoded by 1 to 4 bytes. According to the different bytes used, it can also be divided into UTF8 and utf8mb4. mb4 means up to 4 bytes. Generally speaking, UTF8 uses three bytes for encoding. In addition, there are utf16 and utf32. Utf16 uses 2 or 4 bytes to encode a character, and utf32 uses 4 bytes to encode a character.

It should be noted that some emoji expressions need to be represented using utf8mb4.

2 Character sets supported in MySQL

MySQL supports many character sets. Taking my local character set as an example, I use the show charset command to view the character sets supported by the current server. The results are as follows:

[email protected]:(none) 22:46:48>>show charset;
+----------+-----------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| cp850 | DOS West European | cp850_general_ci | 1 |
| hp8 | HP West European | hp8_english_ci | 1 |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |
| ascii | US ASCII | ascii_general_ci | 1 |
| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 |
| sjis | Shift-JIS Japanese | sjis_japanese_ci | 2 |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | 1 |
| tis620 | TIS620 Thai | tis620_thai_ci | 1 |
| euckr | EUC-KR Korean | euckr_korean_ci | 2 |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci | 1 |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 |
| greek | ISO 8859-7 Greek | greek_general_ci | 1 |
| cp1250 | Windows Central European | cp1250_general_ci | 1 |
| gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
| armscii8 | ARMSCII-8 Armenian | armscii8_general_ci | 1 |
| utf8 | UTF-8 Unicode | utf8_general_ci | 3 |
| ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 |
| cp866 | DOS Russian | cp866_general_ci | 1 |
| keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci | 1 |
| macce | Mac Central European | macce_general_ci | 1 |
| macroman | Mac West European | macroman_general_ci | 1 |
| cp852 | DOS Central European | cp852_general_ci | 1 |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci | 4 |
| cp1251 | Windows Cyrillic | cp1251_general_ci | 1 |
| utf16 | UTF-16 Unicode | utf16_general_ci | 4 |
| cp1256 | Windows Arabic | cp1256_general_ci | 1 |
| cp1257 | Windows Baltic | cp1257_general_ci | 1 |
| utf32 | UTF-32 Unicode | utf32_general_ci | 4 |
| binary | Binary pseudo charset | binary | 1 |
| geostd8 | GEOSTD8 Georgian | geostd8_general_ci | 1 |
| cp932 | SJIS for Windows Japanese | cp932_japanese_ci | 2 |
| eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci | 3 |
+----------+-----------------------------+---------------------+--------+
39 rows in set (0.00 sec)

As you can see, a total of 39 character sets are supported. Of course, this number varies in different versions. Among these 39 character sets, the main ones we need to remember are the following:

+----------+-----------------------------+---------------------+--------+
| Charset | Description | Default collation | Maxlen |
+----------+-----------------------------+---------------------+--------+
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| ascii | US ASCII | ascii_general_ci | 1 |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 |
| gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 |
| utf8 | UTF-8 Unicode | utf8_general_ci | 3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci | 4 |
+----------+-----------------------------+---------------------+--------+

Here, we only need to know how many bytes their maximum length contains.

3 Verification rules

The validation rule of a character set refers to the criterion used when comparing the size of a character set. For example, if we compare the size of a and B, if we do not consider the case, then a<B. If we consider the case, then a>B. In other words, the same character set but different comparison rules will produce different sorting results for a column of data.

The character validation rules in MySQL can be viewed using the show collation; syntax, as follows:

[email protected]:(none) 23:00:36>>show collation;
+-----------------------+----------+-----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+-----------------------+----------+-----+---------+----------+---------+
| big5_chinese_ci | big5 | 1 | Yes | Yes | 1 |
| big5_bin | big5 | 84 | | Yes | 1 |
| dec8_swedish_ci | dec8 | 3 | Yes | Yes | 1 |
| dec8_bin | dec8 | 69 | | Yes | 1 |
...........
| koi8r_general_ci | koi8r | 7 | Yes | Yes | 1 |
| koi8r_bin | koi8r | 74 | | Yes | 1 |
| latin1_german1_ci | latin1 | 5 | | Yes | 1 |
| koi8u_general_ci | koi8u | 22 | Yes | Yes | 1 |
| utf8_general_ci | utf8 | 33 | Yes | Yes | 1 |
| utf8_bin | utf8 | 83 | | Yes | 1 |
| utf8_unicode_ci | utf8 | 192 | | Yes | 8 |
| utf8_icelandic_ci | utf8 | 193 | | Yes | 8 |
| utf8_latvian_ci | utf8 | 194 | | Yes | 8 |
| utf8_romanian_ci | utf8 | 195 | | Yes | 8 |
| utf8_slovenian_ci | utf8 | 196 | | Yes | 8 |
| utf8_polish_ci | utf8 | 197 | | Yes | 8 |
| utf8_estonian_ci | utf8 | 198 | | Yes | 8 |
+-----------------------+----------+-----+---------+----------+---------+
195 rows in set (0.00 sec)

We can see that there are a total of 195 comparison rules in the results. Each character set contains its own default validation rule. Let us briefly explain one: utf8_polish_ci compares based on Polish. This validation rule consists of three parts. The comparison rule name starts with the name of the character set associated with it. utf8 refers to the comparison rule of the utf8 character set, polish refers to Polish, and _ci refers to case-insensitive.

For the last suffix, we can summarize as follows:

_ai is accent-insensitive

_as is accent sensitive

_ci case insensitive (insensitive) not case sensitive

_cs case sensitive (sensitive) case sensitive

_bin binary

In the above results, we can also see that the value of the default column of some validation rules is yes, which means that this validation rule is the default validation rule for the character set.

4 Character sets and comparison rules on the server

MySQL provides two system variables to represent the server-level character set and comparison rules, one is character_set_sever and the other is collation_server. We can view their default values:

[email protected]:(none) 23:12:28>>show variables like 'character_set_server';
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| character_set_server | utf8 |
+----------------------+-------+
1 row in set (0.00 sec)

[email protected]:(none) 23:12:47>>show variables like 'collation_server';
+------------------+-----------------+
| Variable_name | Value |
+------------------+-----------------+
| collation_server | utf8_general_ci |
+------------------+-----------------+
1 row in set (0.00 sec)

[email protected]:(none) 23:12:57>>

In MySQL, character sets and validation rules are divided into four levels: server level, database level, table level, and field level. Here we take examples to see them respectively:

Database levelmysql 23:23:48>>create database yyz character set utf8 collate utf8_general_ci;
Query OK, 1 row affected (0.01 sec)

Table levelmysql 23:24:23>>create table yyz.yyz (name varchar(10)) charset gb2312 collate gb2312_chinese_ci;
Query OK, 0 rows affected (0.04 sec)

Field levelmysql 23:27:06>>
[email protected]:(none) 23:27:06>>alter table yyz.yyz add num varchar(10) charset gbk collate gbk_chinese_ci;
Query OK, 0 rows affected (0.11 sec)
Records: 0 Duplicates: 0 Warnings: 0

View the resultsmysql 23:28:27>>show create table yyz.yyz;
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| yyz | CREATE TABLE `yyz` (
 `name` varchar(10) DEFAULT NULL,
 `num` varchar(10) CHARACTER SET gbk DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=gb2312 |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Since the character set and comparison rules are related to each other, if we only modify the character set, the comparison rules will also change. If we only modify the comparison rules, the character set will also change. The specific rules are as follows:

If you only modify the character set, the comparison rules will become the default comparison rules of the modified character set.
If you only modify the comparison rule, the character set will become the character set corresponding to the modified comparison rule.

There are 4 more things to remember:

  • For the database, if there is no fatal character set and comparison rules in the statements for creating and modifying the database, the server-level character set and comparison rules will be used as the database's character set and comparison rules.
  • For a table, if the character set and comparison rules are not specified in the statements for creating or modifying the table, the character set and comparison rules of the database where the table is located will be used as the character set and comparison rules of the table;
  • For a column, if the character set and comparison rules are not specified in the creation and modification statements, the character set and comparison rules of the table where the column is located will be used as the character set and comparison rules of the column.
  • For columns that store strings, different columns in the same table can have different character sets and comparison rules.

The above is a detailed explanation of the character set and validation rules in MySQL. For more information about the character set and validation rules in MySQL, please pay attention to other related articles on 123WORDPRESS.COM!

You may also be interested in:
  • MySQL character set viewing and modification tutorial
  • How to modify the MySQL character set
  • Causes and solutions to the garbled character set problem in MySQL database
  • How to change MySQL character set utf8 to utf8mb4
  • How to unify the character set on an existing mysql database
  • MySQL character set garbled characters and solutions
  • Detailed explanation of JDBC's processing of Mysql utf8mb4 character set
  • Solution to index failure in MySQL due to different field character sets
  • How to change the default character set of MySQL to utf8 on MAC
  • How to set the character set for mysql under Docker
  • How to solve the problem of MySQL query character set mismatch

<<:  In-depth explanation of the various methods and advantages and disadvantages of JavaScript inheritance

>>:  What is dns-prefetch? Front-end optimization: DNS pre-resolution improves page speed

Recommend

Basic usage of wget command under Linux

Table of contents Preface 1. Download a single fi...

JavaScript anti-shake case study

principle The principle of anti-shake is: you can...

MySQL 8.0.18 installation and configuration method graphic tutorial under MacOS

This article records the installation of MySQL 8....

HTML end tag issue and w3c standard

According to the principles of W3C, each start tag...

Use pure CSS to disable the a tag in HTML without JavaScript

In fact, this problem has already popped up when I...

VMware virtual machine three connection methods example analysis

NAT In this way, the virtual machine's networ...

Introduction to JavaScript strict mode use strict

Table of contents 1. Overview 1.1 What is strict ...

About the overlap of margin value and vertical margin in CSS

Margin of parallel boxes (overlap of double margi...

Vue realizes price calendar effect

This article example shares the specific code of ...

A brief discussion on the magical uses of CSS pseudo-elements and pseudo-classes

CSS plays a very important role in a web page. Wi...

The leftmost matching principle of MySQL database index

Table of contents 1. Joint index description 2. C...