The difference between GB2312, GBK and UTF-8 in web page encoding

The difference between GB2312, GBK and UTF-8 in web page encoding

First of all, we need to understand that GB2312, GBK and UTF-8 are all character encodings. In addition, there are many other character encodings. It’s just that for our Chinese websites, these three codes are more commonly used. To briefly explain why we need encoding, in computers, ASCII code is used to store text information, and each character corresponds to a unique ASCII code. Computers were originally invented in the United States, and they also used keyboards and the letters on them, so their ASCII characters were easy to solve. But it is different in China. Each Chinese character must correspond to a unique ASCII code. In this way, the character encoding standards formulated by the state came into being: GB2312, GBK, etc. Other countries and other languages ​​also have their corresponding encoding standards. GB means national standard. GB2312 and GBK are mainly used for encoding Chinese characters, while UTF-8 is used worldwide. This means that if your web pages are mainly for Chinese people who use Chinese, it is very good to use GB2312 and GBK. The text storage volume is small and there are some advantages. If your web page is to be viewed by the world, and you use GB2312 and GBK as the web page encoding, some browsers on computers do not have this encoding, and the Chinese characters on your web page will become unrecognizable garbled characters. They are usually used in the meta tags of web pages, for example:, indicating that this page uses GB2312 encoding. This information is for the browser to see, and the browser will give priority to using the encoding information extracted from the header of the web page to decode the web page. Of course, we can also force the browser to use a certain encoding to interpret the web page, so that we can see the legendary garbled code.

GBK, GB2312, etc. and UTF8 must be converted to each other through Unicode encoding:

GBK, GB2312-Unicode-UTF8
UTF8-Unicode-GBK, GB2312

For a website or forum, if there are many English characters, it is recommended to use UTF-8 to save space. However, many forum plug-ins now generally only support GBK.
If it is a Chinese website, GB2312 GBK is recommended, but sometimes there are still some problems. In order to avoid all garbled code problems, UTF-8 should be used. It is also very convenient to support internationalization in the future. UTF-8 can be regarded as a large character set, which includes the encoding of most texts.

One benefit of using UTF-8 is that users in other regions (such as Hong Kong and Taiwan) can view your text normally without garbled characters without installing simplified Chinese support*. *

gb2312 is the code for simplified Chinese
gbk supports simplified Chinese and traditional Chinese
big5 supports Traditional Chinese
UTF-8 supports almost all characters

The most commonly used code in mainland China is GBK18030. In addition, there are GBK and GB2312. The relationship between these codes is as follows. The earliest Chinese character code was GB2312, which included 6763 Chinese characters and 682 other symbols. The code was revised in 1995 and named GBK1.0, which included a total of 21,886 symbols. Later, the GBK18030 encoding was introduced, which included a total of 27,484 Chinese characters, as well as major minority languages ​​such as Tibetan, Mongolian, and Uyghur. Now the WINDOWS platform must support the GBK18030 encoding.

GB2312 code contains about 6000 Chinese characters (excluding special characters), the encoding range is b0-f7 for the first digit, and a1-fe for the second digit (when the first digit is cf, the second digit is a1-d3). Calculating the number of Chinese characters, it is 6762. Of course there are other characters. Including control keys and other characters, there are about 7573 character codes. The GBK code is an expansion of the GB2312 code, which accommodates more Chinese characters, but it is just an expansion, without any qualitative change. All GB2312 codes are retained, and the code range is expanded on this basis. A total of 22014 character codes (including special characters) are accommodated. The gb18030 code is an expansion based on the gbk code. Because there are more Chinese characters, using only two-bit codes can no longer accommodate the required Chinese characters, so a 2\4-bit mixed method is adopted to support more Chinese character codes. And it retains the original gbk 2-byte encoding, which is compatible with GB2312 and gbk encoded files. It can accommodate approximately 55,657 codes (including special characters). Unicode code (also known as UTF code): commonly known as the universal code, it is committed to using unified coding standards to express the texts of various countries. In order to express more text, UTF-8 uses a 2/3 mixed encoding method. The range of Chinese characters currently accommodated is smaller than that of gbk encoding. And processing Chinese in 3-byte mode brings compatibility issues. The original gbk, GB2312, and gb18030 encoding files cannot be processed normally. There is still a long way to go.

What are the differences between gbk and gb2312

First of all, everyone needs to understand what is GBK? What is GB2312? We need to know that they are all a kind of character encoding, of course there are many kinds of character encoding.

We can understand character encoding as follows:

Computers store binary values ​​of 0 and 1.

8 bits correspond to one byte, which is usually expressed in hexadecimal.

So how can we achieve this if we want to see the characters we want displayed on the computer instead of various numbers 0 and 1?

Here we need to make the computer convert the corresponding hexadecimal values ​​it stores into corresponding characters, including characters in other languages ​​such as English and Chinese, and then output them to the screen.

So encoding means defining a set of rules to specify which values ​​correspond to which characters.

Then character encoding defines a set of rules that specify which value among the many values ​​stored in the computer corresponds to which letter displayed on the computer screen.

To sum up, everyone should understand that GBK and GB2312 are a kind of character encoding.

Let's talk about their differences and similarities in detail below:

Similarities:

1. GBK and GB2312 are both 16 bits!

2. They are usually used in the meta tags of web pages.

Differences:

1. GBK character encoding supports Simplified Chinese and Traditional Chinese!

GBK stands for "Chinese Internal Code Extension Specification" (GBK means the first letter of "national standard" and "extension" of Chinese pinyin, and its English name is Chinese Internal Code Specification). It was formulated by the National Technical Committee of Information Technology Standardization of the People's Republic of China on December 1, 1995. The Standardization Department of the State Administration of Technical Supervision and the Science and Technology and Quality Supervision Department of the Ministry of Electronics Industry jointly identified it as a technical specification guiding document in the form of the document No. 229 of Technical Supervision Letter 1995 on December 15, 1995.

2. GB2312 only supports Simplified Chinese!

"Chinese Character Coded Character Set for Information Interchange" is a set of national standards issued by the General Administration of Standards of China in 1980 and implemented on May 1, 1981. The standard number is GB 2312-1980.
GB 2312 standard includes a total of 6763 Chinese characters, including 3755 first-level Chinese characters and 3008 second-level Chinese characters; at the same time, GB 2312 includes 682 full-width characters including Latin letters, Greek letters, Japanese Hiragana and Katakana letters, and Russian Cyrillic letters.

If your web pages are mainly for Chinese people who speak Chinese, it is very good to use GB2312 and GBK. The text storage volume is small and there are some advantages. If your web page is to be viewed by the world, and you use GB2312 and GBK as the web page encoding, some browsers on computers do not have this encoding, and the Chinese characters on your web page will become unrecognizable garbled characters.

<<:  Detailed explanation of CSS pre-compiled languages ​​and their differences

>>:  Detailed explanation of the payment function code of the Vue project

Recommend

How to automatically import Vue components on demand

Table of contents Global Registration Partial Reg...

Detailed explanation of the use of Linux lseek function

Note: If there are any errors in the article, ple...

How to use the MySQL authorization command grant

The examples in this article run on MySQL 5.0 and...

What are inline elements and block elements?

1. Inline elements only occupy the width of the co...

How to make vue long list load quickly

Table of contents background Main content 1. Comp...

Detailed explanation of desktop application using Vue3 and Electron

Table of contents Vue CLI builds a Vue project Vu...

Using cursor loop to read temporary table in Mysql stored procedure

cursor A cursor is a method used to view or proce...

HTML Tutorial: Collection of commonly used HTML tags (5)

These introduced HTML tags do not necessarily ful...

React tsx generates random verification code

React tsx generates a random verification code fo...

Analysis of Mysql data migration methods and tools

This article mainly introduces the analysis of My...

How to implement scheduled backup of CentOS MySQL database

The following script is used for scheduled backup...

Implementation example of nginx access control

About Nginx, a high-performance, lightweight web ...

Implementation of communication between Vue and Flask

Install axios and implement communication Here we...

Full analysis of MySQL INT type

Preface: Integer is one of the most commonly used...