Introduction to HTML Chinese Character Encoding Standard

In HTML, you need to specify the encoding used by the web page. The general way to specify it is:

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

In the new version of HTML5, you can also use a simpler way:

<meta charset="UTF-8">

Because there are many languages and scripts used in the world, in order to meet the requirements of cross-language and cross-platform text conversion and processing, the international organization developed the Unicode encoding, which was officially announced in 1994 and has been continuously upgraded. It provides 1,114,112 code points and defines a character set of all human-readable characters, including ancient writing symbols. However, in order to represent so many characters, Unicode encoding usually uses 32 bits (ie 4 bytes) to represent one character, which takes up a relatively large storage space. Commonly used characters (such as ASCII) also require longer encodings, and memory usage efficiency is relatively low.

For this purpose, a variable width encoding format UTF-8 using 8-bit code units is defined. In UTF-8 encoding, some commonly used characters can be represented using fewer bytes, while less commonly used characters use more bytes, which improves the efficiency of encoding space occupation. For example, ASCII code is still represented by one byte, which is achieved by identifying some high bits in the encoding, which builds a bridge between ASCII encoding and Unicode. The specific encoding method is:
0000~007F: 0xxxxxxx, stored as one byte, with 7 bits to represent different characters, generally corresponding to ASCII characters
0080~07FF: 110xxxxx, 10xxxxxx, stored as two bytes, 11 bits can represent different characters
0800~FFFF: 1110xxxx, 10xxxxxx, 10xxxxxx, stored as 3 bytes, 16 bits are used to represent different characters
10000~1FFFFF: 11110xxx, 10xxxxxx, 10xxxxxx, 10xxxxxx, stored as 4 bytes, with 21 bits representing different characters. You can see the pattern: if the highest bit is not 0, then the number before 0 in the number represents the number of code elements included in a sequence. All codewords after the first codeword in a sequence have a 10 prefix. Unicode encoding also has other encoding formats such as UTF-16 and UTF-32, but UTF-8 is more commonly used and can also represent all encoding sets.

In the past, the most commonly used encoding for representing Chinese characters in computers was GB2312, which was released in 1980. Its full name is "Chinese Character Coded Character Set for Information Interchange - Basic Set". It uses two bytes to represent a Chinese character, and includes a total of 6763 Chinese characters and 682 non-Chinese graphic characters, which is compatible with the ASCII character set. However, this encoding contains relatively few Chinese characters and cannot represent the traditional Chinese characters used in Hong Kong and Taiwan. It also cannot represent some uncommon characters and characters in ancient books, which causes a lot of inconvenience in practical use. Later, GB2312 was expanded to form the GBK encoding standard, which can represent traditional Chinese characters and some variant characters, and its scope of use was expanded.

In order to suit a wider range of applications, the GB18030 encoding standard was released. GB18030-2000 includes 27,533 Chinese characters, and GB18030-2005 includes 70,244 Chinese characters, and also includes Tibetan, Mongolian, Dai, Yi, Korean, Uyghur and other minority languages. The total encoding space of GB18030 exceeds 1.5 million code positions. The encoding adopts single-byte, double-byte and four-byte encoding for characters. The single-byte part adopts the encoding structure and rules of GB/T11383, using code positions from 0x00 to 0x7F, corresponding to the corresponding code positions of ASCII code; for the double-byte part, the first byte code position is from 0x81 to 0xFE, and the last byte code positions are 0x40 to 0x7E and 0x80 to 0xFE respectively; the four-byte part adopts 0x30 to 0x39 which is not adopted by GB/T11383 as the suffix to expand the double-byte encoding. The expanded four-byte encoding has a range of 0x81308130 to 0xFE39FE39. The GB18030 code is still being expanded.

In order to represent more Chinese characters and some special symbols, and for better compatibility in the future, it is best to use the GB18030 standard for newly created web pages, that is, to specify the encoding using one of the following two methods:

<meta http-equiv="Content-Type" content="text/html;charset=gb18030">
<meta charset="gb18030">

Of course, in order to facilitate the display of foreign characters, you can also use the internationally accepted UTF-8 encoding.

<<: Detailed explanation of HTML basics (Part 2)

>>: How to use Greek letters in HTML pages

Docker implements MariaDB sub-library and sub-table and read-write separation functions

Introduction to HTML Chinese Character Encoding Standard

Docker implements MariaDB sub-library and sub-table and read-write separation functions

Detailed explanation of function classification and examples of this pointing in Javascript

Sorting out some common problems encountered in CSS (Hack logo/fixed container/vertical centering of images)

How to install PostgreSQL11 on CentOS7

Introduction to fourteen cases of SQL database

Detailed explanation of the meaning and difference between MySQL row locks and table locks

Detailed explanation of the process of deleting the built-in version of Python in Linux

Let me teach you how to use font icons in CSS

Code to enable IE8 in IE7 compatibility mode

Vue implements the frame rate playback of the carousel

Recommend

Syntax alias problem based on delete in mysql

How to change password in MySQL 5.7.18

Let IE6, IE7, IE8 support CSS3 rounded corners and shadow styles

Docker primary network port mapping configuration

Detailed explanation of MySQL database addition, deletion and modification operations

Javascript basics about built-in objects

Implementation of MySQL asc and desc data sorting

Reasons why MySQL cancelled Query Cache

MySQL 8.0.11 installation and configuration method graphic tutorial (win10)

15 Vim quick reference tables to help you increase your efficiency by N times

Installation and configuration of mysql 8.0.15 under Centos7

Two ways to open and close the mysql service

Detailed explanation of js's event loop event queue in the browser

How to use Linux paste command

Detailed explanation of the idea of implementing dynamic columns in angularjs loop object properties