Should I use UTF-8 or GB2312 encoding when building a website?

Should I use UTF-8 or GB2312 encoding when building a website?

Often when we open foreign websites, garbled characters appear, or when we open many non-English foreign websites, all they display are 口口口口口 characters.

The WordPress program uses UTF-8, and many cms use GB2312.

● Why are there so many codes?

● What is the difference between UTF-8 and GB2312?

● When we build websites in the country, is it better to use UTF-8 encoding format or GB2312 encoding format?

1. The origin of various codes

Maybe many students have always been confused about the various character encoding methods and have no idea why they have so many encodings.

ANSI encoding

In fact, a long time ago, there was a group of people who decided to use 8 transistors that could be opened and closed to combine into different states to represent everything in the world. They saw that 8 switch states were good, so they called this a "byte."

Initially, computers were only used in the United States. An eight-bit byte can be combined into a total of 256 (2 to the power of 8) different states.

They assigned special uses to the 32 states numbered starting from 0. Once the terminal or printer encounters these agreed bytes being transmitted, it will perform some agreed actions.

When encountering 00×10, the terminal will wrap the line. When encountering 0×07, the terminal will beep at people. For example, when encountering 0×1b, the printer will print highlighted words, or the terminal will display letters in color. They thought this was a good idea, so they called the byte states below 0×20 "control codes".

They then represented all spaces, punctuation marks, numbers, uppercase and lowercase letters with consecutive byte states, up to number 127, so that computers could use different bytes to store English text. Everyone felt good about this, so they called this scheme ANSI's "Ascii" encoding (American Standard Code for Information Interchange). At that time, all computers in the world used the same ASCII scheme to save English text.

Extended ANSI encoding

Later, just like the construction of the Tower of Babel, people all over the world began to use computers, but many countries did not use English, and many of their letters were not in ASCII. In order to save their text on computers, they decided to use the empty space after number 127 to represent these new letters and symbols, and also added many shapes such as horizontal lines, vertical lines, and crosses that were needed when drawing tables, and the serial number was raised to the last state 255. The character set from 128 to 255 is called the "extended character set". From then on, greedy humans had no new states to use, and the United States probably didn’t expect that other countries would need computers.

GB2312 encoding

When people in China got computers, there were no available byte states to represent Chinese characters, not to mention that there were more than 6,000 commonly used Chinese characters that needed to be saved. The Chinese people simply cancelled those strange symbols after number 127 without any hesitation.

The rule is: a character less than 127 has the same meaning as before, but when two characters greater than 127 are connected together, they represent a Chinese character. The first byte (he calls it the high byte) is used from 0xA1 to 0xF7, and the second byte (the low byte) is used from 0xA1 to 0xFE. In this way, we can combine about 7,000 simplified Chinese characters.

In these codes, we also included mathematical symbols, Roman and Greek letters, and Japanese kana. Even the numbers, punctuation marks, and letters that already existed in ASCII were all re-encoded as two-byte characters. These are commonly known as "full-width" characters, and those originally below size 127 are called "half-width" characters. So this Chinese character scheme is called "GB2312". GB2312 is a Chinese extension of ASCII.

GBK and GB18030 encoding

However, there are too many Chinese characters in China, and we soon discovered that there are many people’s names that cannot be typed here, especially the names of some Chinese leaders, which are very troublesome if they cannot be typed. So we have to continue to find out the code positions that are not used in GB2312 and use them without hesitation.

Later, it was still not enough, so the requirement that the low byte must be the internal code after number 127 was no longer required. As long as the first byte is greater than 127, it is fixed to indicate that this is the beginning of a Chinese character, regardless of whether the following is content in the extended character set. The resulting expanded encoding scheme is called the GBK standard, which includes all the contents of GB2312 and adds nearly 20,000 new Chinese characters (including traditional Chinese characters) and symbols.

Later, ethnic minorities also wanted to use computers, so we expanded it again and added thousands of new characters for ethnic minorities, and GBK was expanded to GB18030. From now on, the culture of the Chinese nation can be passed on in the computer age.

The biggest feature of this standard is that two-byte Chinese characters and one-byte English characters coexist in the same encoding scheme. Therefore, in order to support Chinese processing, the programs they write must pay attention to the value of each byte in the string. If this value is greater than 127, then it is considered that a character in the double-byte character set appears.

At that time, all programmers who had received programming training had to recite the following mantra hundreds of times every day:

"One Chinese character counts as two English characters! One Chinese character counts as two English characters..."

UNICODE encoding

Because at that time, each country had its own encoding standard like China, and as a result, no one understood each other's encoding, and no one supported others' encoding. Even the mainland and Taiwan, which were only 150 nautical miles apart and used the same language, adopted different encoding schemes:

At that time, if the Chinese people wanted their computers to display Chinese characters, they had to install a "Chinese character system". Specially used to deal with the display and input issues of Chinese characters.

However, the program written by a Taiwanese person must be installed with another "Yitian Chinese Character System" that supports BIG5 encoding before it can be used. If the wrong character system is installed, the display will be messed up! What should I do about this? Moreover, there are poor people in the world who cannot use computers for the time being. What should be done with their writing?

At this moment, an angel appeared in time - an international organization called ISO (International Organization for Standardization) decided to tackle this problem. The method they adopted was very simple: abolish all regional coding schemes and create a new one that includes all cultures, letters and symbols on the earth! They planned to call it UCS, commonly known as UNICODE. Universal Multiple-Octet Coded Character Set

The era when one Chinese character was counted as two English characters in UNICODE is almost over.

Whether it is half-width English letters or full-width Chinese characters, they are all unified as "one character"! At the same time, they are all unified "two bytes"

UTF-8 and UTF-16

When UNICODE came, it was accompanied by the rise of computer networks. How to transmit UNICODE on the network was also an issue that had to be considered, so many UTF (UCS Transfer Format) standards for transmission appeared. As the name suggests, UTF8 transmits data 8 bits at a time, and UTF16 transmits data 16 bits at a time. However, for the reliability of transmission, there is not a direct correspondence from UNICODE to UTF, but it requires some algorithms and rules to convert.

The Future of UCS-4

As mentioned earlier, UNICODE uses two bytes to represent one character, which can combine a total of 65535 different characters, which can probably cover the symbols of all cultures in the world. If it is still not enough, it doesn’t matter. ISO has prepared the UCS-4 scheme. To put it simply, four bytes are used to represent one character. In this way, we can combine 2.1 billion different characters (the highest bit has other uses). This can probably be used until the day when the Galactic Federation is established in China!

2. Why do some websites sometimes display garbled characters when opened?

Web page garbled characters are formed when browsers (such as IE, etc.) interpret HTML web pages.

If the wrong language is written in the code of the website page (relatively rare), it will look like this:

<HTML>

<HEAD>

<META CONTENT="text/html; charset=ISO-8859-1"></HEAD>……

</HTML>

When the browser displays this page, garbled characters will appear. Because the browser identifies the language of this page as "European language".

The solution is to change the language "ISO-8859-1" to GB2312, or change it to BIG5 if it is a traditional Chinese web page.

Another possibility is that the webpage does not indicate what language it uses.

<META CONTENT="text/html; charset=XXXXX">, this line.

And your computer's default language is not this one. For example, this problem often occurs when we visit some Japanese websites. This is mainly because the programmers developed the website for local people. Since the local language is the default, there is no garbled code. But you are an outsider, and your operating system itself is not in the local language by default. So you have to change the language manually.

As for the situation where "mouth, mouth, mouth, mouth" occurs

This is because the website does not use UTF-8 encoding but local encoding, such as Mongolian or Arabic encoding. Your computer does not have this encoding, so it cannot be recognized.

The solution is to install a multi-language support package for the browser in advance (for example, you need to install a multi-language support package when installing IE). In this way, when garbled characters appear when browsing the web page, you can select "View" / "Encoding" / "Auto Select" / Mongolian under the menu bar in the browser). If it is Traditional Chinese, select "View" / "Encoding" / "Auto Select" / Arabic, and so on for other languages. Select the corresponding language system. This can eliminate the garbled characters on the web page.

3. What coding is better for developing websites at present?

Our general understanding is:

UTF-8 is a universal code that perfectly supports Chinese encoding. If the website we make can be accessed normally by foreign users, it is best to use UTF-8.

GB2312 is a Chinese encoding, mainly for domestic users. If foreign users visit a website encoded in GB2312, garbled characters will appear.

Netizens generally believe that UTF-8 is used much more than GB2312, and everyone is in favor of using UTF-8.

From a survey of foreign websites, we can also see that:

From this figure, we can see that during the period 2001-2008, the use of GB2312 encoding was not large, but it was still steadily increasing; the blue line shows that more and more websites are using UTF-8.

I selected several large domestic portals to see what encoding format they use:

Maybe some students will ask why several domestic websites use GB2312 more often.

I have also thought about this question, I think. There should be 3 reasons:

1. These domestic websites have a long history and initially used GB2312 encoding. Now the difficulty and risk of converting to UTF-8 (previous web pages) is too great.

2. UTF-8 encoded files take up more space than GB2312 files. Although this can be ignored under the current hardware environment, these portal websites basically generate static pages for all pages in order to reduce server load. UTF-8 saved files will be relatively large. The amount of files generated every day for portal-level websites is still very large, and the storage cost increases accordingly.

3. Because the amount of network transmission data of UTF-8 encoding is larger than that of GB2312 decoding, it is not suitable for portal-level websites. This will invisibly increase the bandwidth, and using GB2312 is undoubtedly the best optimization for network traffic.

Therefore, when building a new website, it is recommended to choose UTF-8. Because there are no reasons mentioned above, compatibility is the best policy.

<<:  Quickly solve the white gap problem (flash screen) when CSS uses @keyframes to load images for the first cycle

>>:  Parsing MySQL binlog

Recommend

MySQL variable declaration and stored procedure analysis

Declaring variables Setting Global Variables set ...

Create a code example of zabbix monitoring system based on Dockerfile

Use the for loop to import the zabbix image into ...

jQuery plugin to implement floating menu

Learn a jQuery plugin every day - floating menu, ...

Three common style selectors in html css

1: Tag selector The tag selector is used for all ...

The difference between div and span in HTML (commonalities and differences)

Common points: The DIV tag and SPAN tag treat som...

WeChat applet implements the Record function

This article shares the specific code for the WeC...

How to solve the problem of margin overlap

1. First, you need to know what will trigger the v...

TypeScript installation and use and basic data types

The first step is to install TypeScript globally ...

Vue realizes click flip effect

Use vue to simply implement a click flip effect f...

Web Design Experience: Efficiently Writing Web Code

Originally, this seventh chapter should be a deep ...

Detailed instructions for installing SuPHP on CentOS 7.2

By default, PHP on CentOS 7 runs as apache or nob...

In-depth explanation of the style feature in Vue3 single-file components

Table of contents style scoped style module State...

About Vue virtual dom problem

Table of contents 1. What is virtual dom? 2. Why ...

The difference between distinct and group by in MySQL

Simply put, distinct is used to remove duplicates...