Introduction to HTML Chinese Character Encoding Standard

Introduction to HTML Chinese Character Encoding Standard

In HTML, you need to specify the encoding used by the web page. The general way to specify it is:

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

In the new version of HTML5, you can also use a simpler way:

<meta charset="UTF-8">

Because there are many languages ​​and scripts used in the world, in order to meet the requirements of cross-language and cross-platform text conversion and processing, the international organization developed the Unicode encoding, which was officially announced in 1994 and has been continuously upgraded. It provides 1,114,112 code points and defines a character set of all human-readable characters, including ancient writing symbols. However, in order to represent so many characters, Unicode encoding usually uses 32 bits (ie 4 bytes) to represent one character, which takes up a relatively large storage space. Commonly used characters (such as ASCII) also require longer encodings, and memory usage efficiency is relatively low.

For this purpose, a variable width encoding format UTF-8 using 8-bit code units is defined. In UTF-8 encoding, some commonly used characters can be represented using fewer bytes, while less commonly used characters use more bytes, which improves the efficiency of encoding space occupation. For example, ASCII code is still represented by one byte, which is achieved by identifying some high bits in the encoding, which builds a bridge between ASCII encoding and Unicode. The specific encoding method is:
0000~007F: 0xxxxxxx, stored as one byte, with 7 bits to represent different characters, generally corresponding to ASCII characters
0080~07FF: 110xxxxx, 10xxxxxx, stored as two bytes, 11 bits can represent different characters
0800~FFFF: 1110xxxx, 10xxxxxx, 10xxxxxx, stored as 3 bytes, 16 bits are used to represent different characters
10000~1FFFFF: 11110xxx, 10xxxxxx, 10xxxxxx, 10xxxxxx, stored as 4 bytes, with 21 bits representing different characters. You can see the pattern: if the highest bit is not 0, then the number before 0 in the number represents the number of code elements included in a sequence. All codewords after the first codeword in a sequence have a 10 prefix. Unicode encoding also has other encoding formats such as UTF-16 and UTF-32, but UTF-8 is more commonly used and can also represent all encoding sets.

In the past, the most commonly used encoding for representing Chinese characters in computers was GB2312, which was released in 1980. Its full name is "Chinese Character Coded Character Set for Information Interchange - Basic Set". It uses two bytes to represent a Chinese character, and includes a total of 6763 Chinese characters and 682 non-Chinese graphic characters, which is compatible with the ASCII character set. However, this encoding contains relatively few Chinese characters and cannot represent the traditional Chinese characters used in Hong Kong and Taiwan. It also cannot represent some uncommon characters and characters in ancient books, which causes a lot of inconvenience in practical use. Later, GB2312 was expanded to form the GBK encoding standard, which can represent traditional Chinese characters and some variant characters, and its scope of use was expanded.

In order to suit a wider range of applications, the GB18030 encoding standard was released. GB18030-2000 includes 27,533 Chinese characters, and GB18030-2005 includes 70,244 Chinese characters, and also includes Tibetan, Mongolian, Dai, Yi, Korean, Uyghur and other minority languages. The total encoding space of GB18030 exceeds 1.5 million code positions. The encoding adopts single-byte, double-byte and four-byte encoding for characters. The single-byte part adopts the encoding structure and rules of GB/T11383, using code positions from 0x00 to 0x7F, corresponding to the corresponding code positions of ASCII code; for the double-byte part, the first byte code position is from 0x81 to 0xFE, and the last byte code positions are 0x40 to 0x7E and 0x80 to 0xFE respectively; the four-byte part adopts 0x30 to 0x39 which is not adopted by GB/T11383 as the suffix to expand the double-byte encoding. The expanded four-byte encoding has a range of 0x81308130 to 0xFE39FE39. The GB18030 code is still being expanded.

In order to represent more Chinese characters and some special symbols, and for better compatibility in the future, it is best to use the GB18030 standard for newly created web pages, that is, to specify the encoding using one of the following two methods:

<meta http-equiv="Content-Type" content="text/html;charset=gb18030">
<meta charset="gb18030">

Of course, in order to facilitate the display of foreign characters, you can also use the internationally accepted UTF-8 encoding.

<<:  Detailed explanation of HTML basics (Part 2)

>>:  How to use Greek letters in HTML pages

Recommend

Syntax alias problem based on delete in mysql

Table of contents MySQL delete syntax alias probl...

How to change password in MySQL 5.7.18

How to change the password in MySQL 5.7.18: 1. Fi...

Let IE6, IE7, IE8 support CSS3 rounded corners and shadow styles

I want to make a page using CSS3 rounded corners ...

Docker primary network port mapping configuration

Port Mapping Before the Docker container is start...

Javascript basics about built-in objects

Table of contents 1. Introduction to built-in obj...

Implementation of MySQL asc and desc data sorting

Data sorting asc, desc 1. Single field sorting or...

Reasons why MySQL cancelled Query Cache

MySQL previously had a query cache, Query Cache. ...

MySQL 8.0.11 installation and configuration method graphic tutorial (win10)

This article records the installation and configu...

15 Vim quick reference tables to help you increase your efficiency by N times

I started using Linux for development and enterta...

Installation and configuration of mysql 8.0.15 under Centos7

This article shares with you the installation and...

Two ways to open and close the mysql service

Method 1: Use cmd command First, open our DOS win...

Detailed explanation of js's event loop event queue in the browser

Table of contents Preface Understanding a stack a...

How to use Linux paste command

01. Command Overview The paste command will merge...