The difference between GB2312, GBK and UTF-8 in web page encoding

The difference between GB2312, GBK and UTF-8 in web page encoding

First of all, we need to understand that GB2312, GBK and UTF-8 are all character encodings. In addition, there are many other character encodings. It’s just that for our Chinese websites, these three codes are more commonly used. To briefly explain why we need encoding, in computers, ASCII code is used to store text information, and each character corresponds to a unique ASCII code. Computers were originally invented in the United States, and they also used keyboards and the letters on them, so their ASCII characters were easy to solve. But it is different in China. Each Chinese character must correspond to a unique ASCII code. In this way, the character encoding standards formulated by the state came into being: GB2312, GBK, etc. Other countries and other languages ​​also have their corresponding encoding standards. GB means national standard. GB2312 and GBK are mainly used for encoding Chinese characters, while UTF-8 is used worldwide. This means that if your web pages are mainly for Chinese people who use Chinese, it is very good to use GB2312 and GBK. The text storage volume is small and there are some advantages. If your web page is to be viewed by the world, and you use GB2312 and GBK as the web page encoding, some browsers on computers do not have this encoding, and the Chinese characters on your web page will become unrecognizable garbled characters. They are usually used in the meta tags of web pages, for example:, indicating that this page uses GB2312 encoding. This information is for the browser to see, and the browser will give priority to using the encoding information extracted from the header of the web page to decode the web page. Of course, we can also force the browser to use a certain encoding to interpret the web page, so that we can see the legendary garbled code.

GBK, GB2312, etc. and UTF8 must be converted to each other through Unicode encoding:

GBK, GB2312-Unicode-UTF8
UTF8-Unicode-GBK, GB2312

For a website or forum, if there are many English characters, it is recommended to use UTF-8 to save space. However, many forum plug-ins now generally only support GBK.
If it is a Chinese website, GB2312 GBK is recommended, but sometimes there are still some problems. In order to avoid all garbled code problems, UTF-8 should be used. It is also very convenient to support internationalization in the future. UTF-8 can be regarded as a large character set, which includes the encoding of most texts.

One benefit of using UTF-8 is that users in other regions (such as Hong Kong and Taiwan) can view your text normally without garbled characters without installing simplified Chinese support*. *

gb2312 is the code for simplified Chinese
gbk supports simplified Chinese and traditional Chinese
big5 supports Traditional Chinese
UTF-8 supports almost all characters

The most commonly used code in mainland China is GBK18030. In addition, there are GBK and GB2312. The relationship between these codes is as follows. The earliest Chinese character code was GB2312, which included 6763 Chinese characters and 682 other symbols. The code was revised in 1995 and named GBK1.0, which included a total of 21,886 symbols. Later, the GBK18030 encoding was introduced, which included a total of 27,484 Chinese characters, as well as major minority languages ​​such as Tibetan, Mongolian, and Uyghur. Now the WINDOWS platform must support the GBK18030 encoding.

GB2312 code contains about 6000 Chinese characters (excluding special characters), the encoding range is b0-f7 for the first digit, and a1-fe for the second digit (when the first digit is cf, the second digit is a1-d3). Calculating the number of Chinese characters, it is 6762. Of course there are other characters. Including control keys and other characters, there are about 7573 character codes. The GBK code is an expansion of the GB2312 code, which accommodates more Chinese characters, but it is just an expansion, without any qualitative change. All GB2312 codes are retained, and the code range is expanded on this basis. A total of 22014 character codes (including special characters) are accommodated. The gb18030 code is an expansion based on the gbk code. Because there are more Chinese characters, using only two-bit codes can no longer accommodate the required Chinese characters, so a 2\4-bit mixed method is adopted to support more Chinese character codes. And it retains the original gbk 2-byte encoding, which is compatible with GB2312 and gbk encoded files. It can accommodate approximately 55,657 codes (including special characters). Unicode code (also known as UTF code): commonly known as the universal code, it is committed to using unified coding standards to express the texts of various countries. In order to express more text, UTF-8 uses a 2/3 mixed encoding method. The range of Chinese characters currently accommodated is smaller than that of gbk encoding. And processing Chinese in 3-byte mode brings compatibility issues. The original gbk, GB2312, and gb18030 encoding files cannot be processed normally. There is still a long way to go.

What are the differences between gbk and gb2312

First of all, everyone needs to understand what is GBK? What is GB2312? We need to know that they are all a kind of character encoding, of course there are many kinds of character encoding.

We can understand character encoding as follows:

Computers store binary values ​​of 0 and 1.

8 bits correspond to one byte, which is usually expressed in hexadecimal.

So how can we achieve this if we want to see the characters we want displayed on the computer instead of various numbers 0 and 1?

Here we need to make the computer convert the corresponding hexadecimal values ​​it stores into corresponding characters, including characters in other languages ​​such as English and Chinese, and then output them to the screen.

So encoding means defining a set of rules to specify which values ​​correspond to which characters.

Then character encoding defines a set of rules that specify which value among the many values ​​stored in the computer corresponds to which letter displayed on the computer screen.

To sum up, everyone should understand that GBK and GB2312 are a kind of character encoding.

Let's talk about their differences and similarities in detail below:

Similarities:

1. GBK and GB2312 are both 16 bits!

2. They are usually used in the meta tags of web pages.

Differences:

1. GBK character encoding supports Simplified Chinese and Traditional Chinese!

GBK stands for "Chinese Internal Code Extension Specification" (GBK means the first letter of "national standard" and "extension" of Chinese pinyin, and its English name is Chinese Internal Code Specification). It was formulated by the National Technical Committee of Information Technology Standardization of the People's Republic of China on December 1, 1995. The Standardization Department of the State Administration of Technical Supervision and the Science and Technology and Quality Supervision Department of the Ministry of Electronics Industry jointly identified it as a technical specification guiding document in the form of the document No. 229 of Technical Supervision Letter 1995 on December 15, 1995.

2. GB2312 only supports Simplified Chinese!

"Chinese Character Coded Character Set for Information Interchange" is a set of national standards issued by the General Administration of Standards of China in 1980 and implemented on May 1, 1981. The standard number is GB 2312-1980.
GB 2312 standard includes a total of 6763 Chinese characters, including 3755 first-level Chinese characters and 3008 second-level Chinese characters; at the same time, GB 2312 includes 682 full-width characters including Latin letters, Greek letters, Japanese Hiragana and Katakana letters, and Russian Cyrillic letters.

If your web pages are mainly for Chinese people who speak Chinese, it is very good to use GB2312 and GBK. The text storage volume is small and there are some advantages. If your web page is to be viewed by the world, and you use GB2312 and GBK as the web page encoding, some browsers on computers do not have this encoding, and the Chinese characters on your web page will become unrecognizable garbled characters.

<<:  Detailed explanation of CSS pre-compiled languages ​​and their differences

>>:  Detailed explanation of the payment function code of the Vue project

Recommend

Transplanting the mkfs.vfat command in busybox under Linux system

In order to extend the disk life for storing audi...

web.config (IIS) and .htaccess (Apache) configuration

xml <?xml version="1.0" encoding=&qu...

Basic usage details of Vue componentization

Table of contents 1. What is componentization? 2....

A brief discussion on the role of Vue3 defineComponent

Table of contents defineComponent overload functi...

Linux system prohibits remote login command of root account

ps: Here is how to disable remote login of root a...

Hexadecimal color codes (full)

Red and pink, and their hexadecimal codes. #99003...

Solution to forgetting the MYSQL database password under MAC

Quick solution for forgetting MYSQL database pass...

MySQL 8.0.13 manual installation tutorial

This article shares the manual installation tutor...

Install CentOS system based on WindowsX Hyper-V

At present, most people who use Linux either use ...

jQuery manipulates cookies

Copy code The code is as follows: jQuery.cookie =...

Basic operations on invisible columns in MySQL 8.0

Table of contents 01 Create invisible columns 02 ...

CSS 3.0 text hover jump special effects code

Here is a text hovering and jumping effect implem...

linux No space left on device 500 error caused by inode fullness

What is an inode? To understand inode, we must st...

Usage of Vue filters and timestamp conversion issues

Table of contents 1. Quickly recognize the concep...

Installation and configuration method of Zabbix Agent on Linux platform

Here is a brief summary of the installation and c...