Unicode signature BOM (Byte Order Mark) issue for UTF-8 files

Unicode signature BOM (Byte Order Mark) issue for UTF-8 files


I recently encountered a strange thing when debugging a Chinese Zen Cart website with UTF8 encoding. The text on the webpage was displayed normally, but when I used IE to view the source file (opened it with Notepad), I found garbled characters. Firefox did not have this problem. After much online verification and testing, the problem was solved. It was actually a problem with the Unicode signature BOM (Byte Order Mark) of the UTF-8 file.

BOM (Byte Order Mark) is a standard mark used to identify encoding in the UTF encoding scheme. In UTF-16, it was originally FF FE, and in UTF-8 it becomes EF BB BF. This flag is optional, and since UTF8 bytes have no order, it can be used to detect whether a byte stream is UTF-8 encoded. Microsoft does this detection, but some software does not and treats it as a normal character.

Microsoft adds three bytes EF BB BF before its own UTF-8 text files. Programs such as Notepad on Windows use these three bytes to determine whether a text file is ASCII or UTF-8. However, this is just a mark made by Microsoft secretly. Other platforms do not have such a mark for UTF-8 text files.

That is to say, a UTF-8 file may have a BOM or may not have a BOM, so how to distinguish them? Three methods. 1. Open the file with UltraEdit-32, switch to hexadecimal editing mode, and check whether there is EF BB BF in the file header. 2. Open it with Dreamweaver, check the page properties, and see if there is a check mark in front of "Include Unicode Signature BOM". 3. Open it with Windows Notepad, select "Save As", and check whether the default encoding of the file is UTF-8 or ANSI. If it is ANSI, it will not have BOM.

I found html_header.php in the Zen Cart template file and discovered that the file did not have a BOM. I saved it with UltraEdit-32, added the BOM, and then uploaded html_header.php. Everything was normal.

Note that when using Convertz to convert a gb2312 file to a UTF-8 file, the default setting is to not include BOM. The above garbled characters may appear without BOM. However, if BOM is included, you should be careful with PHP include files, as EF BB BF will be added in front of the PHP byte stream. Outputting it to the display in advance may cause program errors. One solution is to save all included files as ANSI, and the main file can be UTF-8. To remove the BOM from a file, open it with UlterEdit, switch to hexadecimal editing mode, replace the first three bytes (the damn EF BB BF) with 20, save the file (note to turn off the automatic backup function when saving), then switch to the default editing mode and remove the first three spaces.

I also learned some little knowledge about encoding: the so-called unicode saved files are actually utf-16, which just happens to be the same as the unicode code, but conceptually unicode and utf are two different things. unicode is a memory encoding representation scheme, and utf is a scheme for how to save and transmit unicode. UTF-16 is divided into two types: high byte first (LE) and high byte last (BE). The official utf encoding also includes utf-32, which is also divided into LE and BE. The non-unicode official utf encoding also includes utf-7, which is mainly used for email transmission. The single-byte part of utf-8 is compatible with iso-8859-1. This is mainly because some old systems and library functions cannot handle utf-16 correctly and are forced out. For English characters, it also saves saved file space (at the expense of wasting space for non-English characters). When using iso-8859-1, both utf8 and iso-8859-1 are represented by one byte. When representing other characters, utf-8 uses two or three bytes.

<<:  Summary of Mysql-connector-java driver version issues

>>:  DIV common attributes collection

Recommend

Detailed graphic tutorial on how to enable remote secure access with Docker

1. Edit the docker.service file vi /usr/lib/syste...

Solution for forgetting the root password of MySQL5.7 under Windows 8.1

【background】 I encountered a very embarrassing th...

Several ways to solve the 1px border problem on mobile devices (5 methods)

This article introduces 5 ways to solve the 1px b...

Detailed explanation of MySQL index selection and optimization

Table of contents Index Model B+Tree Index select...

Docker installs the official Redis image and enables password authentication

Reference: Docker official redis documentation 1....

JS Object constructor Object.freeze

Table of contents Overview Example 1) Freeze Obje...

Learn Node.js from scratch

Table of contents url module 1.parse method 2. fo...

Linux virtual memory settings tutorial and practice

What is Virtual Memory? First, I will directly qu...

Detailed explanation of SELINUX working principle

1. Introduction The main value that SELinux bring...

Examples of preview functions for various types of files in vue3

Table of contents Preface 1. Preview of office do...

CSS3 realizes the graphic falling animation effect

See the effect first Implementation Code <div ...

A brief understanding of the relevant locks in MySQL

This article is mainly to take you to quickly und...

How to solve the phantom read problem in MySQL

Table of contents Preface 1. What is phantom read...