Unicode signature BOM (Byte Order Mark) issue for UTF-8 files

Unicode signature BOM (Byte Order Mark) issue for UTF-8 files


I recently encountered a strange thing when debugging a Chinese Zen Cart website with UTF8 encoding. The text on the webpage was displayed normally, but when I used IE to view the source file (opened it with Notepad), I found garbled characters. Firefox did not have this problem. After much online verification and testing, the problem was solved. It was actually a problem with the Unicode signature BOM (Byte Order Mark) of the UTF-8 file.

BOM (Byte Order Mark) is a standard mark used to identify encoding in the UTF encoding scheme. In UTF-16, it was originally FF FE, and in UTF-8 it becomes EF BB BF. This flag is optional, and since UTF8 bytes have no order, it can be used to detect whether a byte stream is UTF-8 encoded. Microsoft does this detection, but some software does not and treats it as a normal character.

Microsoft adds three bytes EF BB BF before its own UTF-8 text files. Programs such as Notepad on Windows use these three bytes to determine whether a text file is ASCII or UTF-8. However, this is just a mark made by Microsoft secretly. Other platforms do not have such a mark for UTF-8 text files.

That is to say, a UTF-8 file may have a BOM or may not have a BOM, so how to distinguish them? Three methods. 1. Open the file with UltraEdit-32, switch to hexadecimal editing mode, and check whether there is EF BB BF in the file header. 2. Open it with Dreamweaver, check the page properties, and see if there is a check mark in front of "Include Unicode Signature BOM". 3. Open it with Windows Notepad, select "Save As", and check whether the default encoding of the file is UTF-8 or ANSI. If it is ANSI, it will not have BOM.

I found html_header.php in the Zen Cart template file and discovered that the file did not have a BOM. I saved it with UltraEdit-32, added the BOM, and then uploaded html_header.php. Everything was normal.

Note that when using Convertz to convert a gb2312 file to a UTF-8 file, the default setting is to not include BOM. The above garbled characters may appear without BOM. However, if BOM is included, you should be careful with PHP include files, as EF BB BF will be added in front of the PHP byte stream. Outputting it to the display in advance may cause program errors. One solution is to save all included files as ANSI, and the main file can be UTF-8. To remove the BOM from a file, open it with UlterEdit, switch to hexadecimal editing mode, replace the first three bytes (the damn EF BB BF) with 20, save the file (note to turn off the automatic backup function when saving), then switch to the default editing mode and remove the first three spaces.

I also learned some little knowledge about encoding: the so-called unicode saved files are actually utf-16, which just happens to be the same as the unicode code, but conceptually unicode and utf are two different things. unicode is a memory encoding representation scheme, and utf is a scheme for how to save and transmit unicode. UTF-16 is divided into two types: high byte first (LE) and high byte last (BE). The official utf encoding also includes utf-32, which is also divided into LE and BE. The non-unicode official utf encoding also includes utf-7, which is mainly used for email transmission. The single-byte part of utf-8 is compatible with iso-8859-1. This is mainly because some old systems and library functions cannot handle utf-16 correctly and are forced out. For English characters, it also saves saved file space (at the expense of wasting space for non-English characters). When using iso-8859-1, both utf8 and iso-8859-1 are represented by one byte. When representing other characters, utf-8 uses two or three bytes.

<<:  Summary of Mysql-connector-java driver version issues

>>:  DIV common attributes collection

Recommend

DOCTYPE element detailed explanation complete version

1. Overview This article systematically explains ...

Implementing a table scrolling carousel effect through CSS animation

An application of CSS animation, with the same co...

Detailed explanation of the functions and usage of MySQL common storage engines

This article uses examples to illustrate the func...

Native JS to achieve book flipping effects

This article shares with you a book flipping effe...

Docker deploys mysql to achieve remote connection sample code

1.docker search mysql查看mysql版本 2. docker pull mys...

Mount the disk in a directory under Ubuntu 18.04

Introduction This article records how to mount a ...

Vue routing relative path jump method

Table of contents Vue routing relative path jump ...

Flame animation implemented with CSS3

Achieve results Implementation Code html <div ...

Several ways to center a box in Web development

1. Record several methods of centering the box: 1...

Various problems encountered by novices when installing mysql into docker

Preface Recently, my computer often takes a long ...