Introduction to HTML Chinese Character Encoding Standard

Introduction to HTML Chinese Character Encoding Standard

In HTML, you need to specify the encoding used by the web page. The general way to specify it is:

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

In the new version of HTML5, you can also use a simpler way:

<meta charset="UTF-8">

Because there are many languages ​​and scripts used in the world, in order to meet the requirements of cross-language and cross-platform text conversion and processing, the international organization developed the Unicode encoding, which was officially announced in 1994 and has been continuously upgraded. It provides 1,114,112 code points and defines a character set of all human-readable characters, including ancient writing symbols. However, in order to represent so many characters, Unicode encoding usually uses 32 bits (ie 4 bytes) to represent one character, which takes up a relatively large storage space. Commonly used characters (such as ASCII) also require longer encodings, and memory usage efficiency is relatively low.

For this purpose, a variable width encoding format UTF-8 using 8-bit code units is defined. In UTF-8 encoding, some commonly used characters can be represented using fewer bytes, while less commonly used characters use more bytes, which improves the efficiency of encoding space occupation. For example, ASCII code is still represented by one byte, which is achieved by identifying some high bits in the encoding, which builds a bridge between ASCII encoding and Unicode. The specific encoding method is:
0000~007F: 0xxxxxxx, stored as one byte, with 7 bits to represent different characters, generally corresponding to ASCII characters
0080~07FF: 110xxxxx, 10xxxxxx, stored as two bytes, 11 bits can represent different characters
0800~FFFF: 1110xxxx, 10xxxxxx, 10xxxxxx, stored as 3 bytes, 16 bits are used to represent different characters
10000~1FFFFF: 11110xxx, 10xxxxxx, 10xxxxxx, 10xxxxxx, stored as 4 bytes, with 21 bits representing different characters. You can see the pattern: if the highest bit is not 0, then the number before 0 in the number represents the number of code elements included in a sequence. All codewords after the first codeword in a sequence have a 10 prefix. Unicode encoding also has other encoding formats such as UTF-16 and UTF-32, but UTF-8 is more commonly used and can also represent all encoding sets.

In the past, the most commonly used encoding for representing Chinese characters in computers was GB2312, which was released in 1980. Its full name is "Chinese Character Coded Character Set for Information Interchange - Basic Set". It uses two bytes to represent a Chinese character, and includes a total of 6763 Chinese characters and 682 non-Chinese graphic characters, which is compatible with the ASCII character set. However, this encoding contains relatively few Chinese characters and cannot represent the traditional Chinese characters used in Hong Kong and Taiwan. It also cannot represent some uncommon characters and characters in ancient books, which causes a lot of inconvenience in practical use. Later, GB2312 was expanded to form the GBK encoding standard, which can represent traditional Chinese characters and some variant characters, and its scope of use was expanded.

In order to suit a wider range of applications, the GB18030 encoding standard was released. GB18030-2000 includes 27,533 Chinese characters, and GB18030-2005 includes 70,244 Chinese characters, and also includes Tibetan, Mongolian, Dai, Yi, Korean, Uyghur and other minority languages. The total encoding space of GB18030 exceeds 1.5 million code positions. The encoding adopts single-byte, double-byte and four-byte encoding for characters. The single-byte part adopts the encoding structure and rules of GB/T11383, using code positions from 0x00 to 0x7F, corresponding to the corresponding code positions of ASCII code; for the double-byte part, the first byte code position is from 0x81 to 0xFE, and the last byte code positions are 0x40 to 0x7E and 0x80 to 0xFE respectively; the four-byte part adopts 0x30 to 0x39 which is not adopted by GB/T11383 as the suffix to expand the double-byte encoding. The expanded four-byte encoding has a range of 0x81308130 to 0xFE39FE39. The GB18030 code is still being expanded.

In order to represent more Chinese characters and some special symbols, and for better compatibility in the future, it is best to use the GB18030 standard for newly created web pages, that is, to specify the encoding using one of the following two methods:

<meta http-equiv="Content-Type" content="text/html;charset=gb18030">
<meta charset="gb18030">

Of course, in order to facilitate the display of foreign characters, you can also use the internationally accepted UTF-8 encoding.

<<:  Detailed explanation of HTML basics (Part 2)

>>:  How to use Greek letters in HTML pages

Recommend

Tomcat8 uses cronolog to split Catalina.Out logs

background If the catalina.out log file generated...

Vuex modularization and namespaced example demonstration

1. Purpose: Make the code easier to maintain and ...

How to use CSS3 to implement a queue animation similar to online live broadcast

A friend in the group asked a question before, th...

Implementation of vscode custom vue template

Use the vscode editor to create a vue template, s...

JS achieves five-star praise effect

Use JS to implement object-oriented methods to ac...

Detailed process of installing the docker plugin in IntelliJ IDEA (2018 version)

Table of contents 1. Development Environment 2. I...

Example of how to generate random numbers and concatenate strings in MySQL

This article uses an example to describe how MySQ...

How to use node to implement static file caching

Table of contents cache Cache location classifica...

Introduction to the common API usage of Vue3

Table of contents Changes in the life cycle react...

Solve the problem of margin merging

1. Merge the margins of sibling elements The effe...

Detailed explanation of the relationship between React and Redux

Table of contents 1. The relationship between red...

Detailed explanation of how to implement secondary cache with MySQL and Redis

Redis Introduction Redis is completely open sourc...

Detailed explanation of MySQL replication principles and practical applications

This article uses examples to illustrate the prin...

About the problem of writing plugins for mounting DOM in vue3

Compared with vue2, vue3 has an additional concep...

How to modify the mysql table partitioning program

How to modify the mysql table partitioning progra...