Introduction to MIME encoding (integrated from online information and practical experience)

Introduction to MIME encoding (integrated from online information and practical experience)

1. MIME: Multipurpose Internet Mail Extensions

The Imperial College of Computer Online Dictionary FOLDOC explains MIME as: "An encoding standard for multi-part, multimedia email and WWW hypertext, used to transmit non-text data such as graphics, sound and fax. MIME is defined in RFC1341 and uses the MIMENCODE method to convert binary data into a combination of characters of an ASCII subset called BASE64."

There is a newsgroup on the Internet that specifically discusses MIME: comp.mail.mime. The FAQ for this newsgroup is available at the following URL:

http://www.cis.ohio-state.edu/hypertext/faq/usenet/mail/mime-faq/mime0/faq.html

MIMENCODE was first called MMENCODE. MIMENCODE was proposed to replace UUENCODE because UUENCODE uses some characters that cause transmission barriers in some mail gateways (especially those that convert ASCII and EBCDIC codes). (Some software cannot correctly decode all UUENCODE algorithms, resulting in difficulty in reading mails.) Therefore, MIME was designed to replace UUENCODE, but the result is that these protocols coexist.

Before the introduction of MIME, only basic ASCII text information could be sent using RFC 822. It was very difficult to include binary files, sounds, animations, etc. in the email content.

MIME provides a method to attach multiple differently encoded files to emails, making up for the shortcomings of the original information format. In fact, MIME is not just an email encoding, but now it has become a part of the HTTP protocol standard.

2. Introduction to MIME encoding

The original reason for encoding emails was that many gateways on the Internet could not correctly transmit 8-bit coded characters, such as Chinese characters. The principle of encoding is to convert 8-bit content into 7-bit form so that it can be transmitted correctly, and then restore it to 8-bit content after the receiver receives it.

Before the MIME protocol, email encoding had used UUENCODE and other encoding methods. However, due to the simplicity of the MIME protocol algorithm and its easy extensibility, it has now become the mainstream email encoding method. It is not only used to transmit 8-bit characters, but also to transmit binary files, such as images, audio and other information in email attachments, and has expanded many MIME-based applications. In terms of encoding, MIME defines two encoding methods: Base64 and QP (Quote-Printable).

1. Base64 encoding

Base64 is a universal method, and its principle is very simple, that is, to express three bytes of data with four bytes. Of these four bytes, only the first 6 bits are actually used, so there is no problem of only being able to transmit 7-bit characters. The abbreviation for Base64 is generally "B".

Base64 encodes the input string or a piece of data into a string containing only 64 characters {'A'-'Z', 'a'-'z', '0'-'9', '+', '/'}, and '=' is used for padding.

The encoding method is to take 6 bits of the input data stream each time, use the value of this 6 bit (0-63) as an index to look up the table, and output the corresponding character.

In this way, every 3 bytes will be encoded as 4 characters (3×8 → 4×6); the characters less than 4 characters are padded with '='.

In some cases, “=?charset?B?xxxxxxxx?=” is used to indicate that xxxxxxxx is Base64 encoded and the character set of the original text is charset. Encode directly within the paragraph body, and wrap the line at appropriate times. MIME recommends a maximum of 76 characters per line.

The Base64 algorithm is very simple. It puts the character stream sequentially into a 24-bit buffer and fills the missing characters with zeros.

The buffer is then truncated into 4 parts, with the high bit first, each part is 6 bits, and re-represented with 64 characters. If the input consists of only one or two bytes, the output will be padded with an equal sign "=". This can prevent additional information from cluttering the encoding.

How to do base64 encoding
Base64 uses 65 characters from the US-ASCII subset, each character is represented by 6 bits
For text strings, the encoding process is as follows. For example, "men":
Convert to US-ASCII value first.

"m" decimal 109
"e" decimal 101
"n" decimal 110
Binary:
m 01101101
e 01100101
n 01101110

Three 8-bit numbers connected together equal 24 bits
011011010110010101101110

Then divide it into 4 6-bit
011011 010110 010101 101110

Now we get 4 values, the decimal value is
27 22 21 46

The corresponding Base64 characters are: b WV u
The encoding is always based on 3 characters, resulting in 4 Base64 characters.

If it is just 2 characters of data, use the special character "=" to complete the 4 characters of Base64.
For example, encoding "me"
01101101 01100101
0110110101100101
011011 010110 0101
111111 (AND, to make up 6 digits)
011011 010110 010100
w
b WU = ("=" to make up 4 characters)
So "bWU=" is the Base64 value of "me".

If it is just 2 characters of data, such as the code "m"
01101101
011011 01
111111
011011 010000
b Q = =
So "bQ==" is the Base64 value of "m".

2. QP Coding

Another method is the QP (Quote-Printable) method, usually abbreviated as the "Q" method. Its principle is to represent an 8-bit character with two hexadecimal values ​​and then add "=" in front. So we can see that the file after QP encoding usually looks like this: =B3=C2=BF=A1=C7=E5=A3= AC=C4=FA=BA=C3=A3=A1.

Quoted-printable encodes the input string or byte range. If there are characters that do not need to be encoded, they are output directly. If encoding is required, output '=' first, followed by the hexadecimal byte value represented by 2 characters. In some cases, “=?charset?Q?xxxxxxxx?=” is used to indicate that xxxxxxxx is a quoted-printable encoding and the character set of the original text is charset. In the paragraph body, encode directly, wrap the line at the appropriate time, and output an additional '=' before the line break.

3. MIME header information

Mail Header

In the email header, there are many domain names inherited from RFC 822, and MIME also adds some. Common standard domain names and their meanings are as follows:

Domain name meaning added by

Received Mail servers at all levels of the transmission path

Return-Path Reply Address Target Mail Server

Delivered-To Sending address Target mail server

Reply-To The reply address of the creator of the email

From The sender's address is the creator of the email.

To recipient address The creator of the email

Cc address The creator of the email

Bcc The creator of the blind copy address

Date The date and time the message was created

Subject The creator of the email

Message-ID Message ID The creator of the email

MIME-Version MIME version of the message creator

Content-Type The type of content of the email creator

Content-Transfer-Encoding Content transfer encoding method The creator of the email

Non-standard, custom domain names all start with X-, such as X-Mailer, X-MSMail-Priority, etc. Their meaning is usually understood only when the program that receives and sends emails is the same.

Section Header

In the segment header, there are roughly the following fields:

Domain Name Meaning

Content-Type The type of the body

Content-Transfer-Encoding The transfer encoding method of the segment body

Content-Disposition: How to arrange the body of a paragraph

Content-ID The ID of the segment

Content-Location The location (path) of the body

Content-Base The base position of the paragraph

Some fields have parameters in addition to values. Values ​​and parameters, and parameters and parameters are separated by ";". The parameter name and parameter value are separated by "=".

1.MIME-Version

Indicates the version number of the MIME used, usually 1.0;

like:

MIME-Version: 1.0

2. Content-Type

Content-Type defines the type of the body. We actually use this identifier to know what type of file is in the body. For example: text/plain means unformatted text, text/html means Html document, image/gif means gif format image, and so on. Content-Type is in the form of "main type/subtype". The main types are text, image, audio, video, application, multipart, message, etc., which respectively represent text, image, audio, video, application, segment, message, etc. Each main type may have multiple subtypes, such as the text type contains plain, html, xml, css and other subtypes. Main types and subtypes beginning with X- also indicate custom types that are not officially registered with IANA, but are mostly already commonly used. For example, application/x-zip-compressed is a ZIP file type. In Windows, most known Content-Types except multipart are listed in the registry's "HKEY_CLASSES_ROOT\MIME\Database\Content Type".

There are many additional provisions in the RFC regarding the form of parameters. Some allow several parameters. The more common ones are:

Main type parameter name meaning

text charset character set

image name

application name

multipart boundary

multipart type

Composite type commonly used in emails: multipart.

The multipart type indicates that the text is composed of multiple parts, and the following subtypes describe the relationship between these parts.

The three types used in emails are:

(1).multipart/alternative: Indicates that the body of the message consists of two parts, and you can choose either one of them. The main function is that when the essay has both text format and html format, you can choose one of the two texts to display. Email client software that supports html format will generally display its HTML text, while those that do not support it will display its Text text.

(2).multipart/mixed: Indicates that multiple parts of the document are mixed, referring to the relationship between the main text and attachments. If the MIME type of the email is multipart/mixed, it means the email has attachments.

(3).multipart/related: Indicates that multiple parts of a document are related. It is generally used to describe the HTML text and its related images.

The multipart type is the essence of MIME email. The email body is divided into multiple sections, each section consists of two parts: section header and section body, and these two parts are also separated by blank lines. The hierarchical relationship between them can be summarized as shown in the following figure:

+------------------------- multipart/mixed ----------------------------+

| |

| +------------------multipart/related ------------------+ |

| | | |

| | +----- multipart/alternative ------+ +----------+ | +------+ |

| | | | | Embedded Resources| | | Attachments| |

| | | +------------+ +------------+ | +----------+ | +------+ |

| | | | Plain text body| | Hypertext body| | | |

| | | +------------+ +------------+ | +----------+ | +------+ |

| | | | | Embedded Resources| | | Attachments| |

| | +----------------------------------+ +----------+ | +------+ |

| | | |

| +------------------------------------------------------+ |

| |

+----------------------------------------------------------------------+

It can be seen that if you want to add attachments to the email, you must define the multipart/mixed segment; if there are embedded resources, at least the multipart/related segment must be defined; if plain text and hypertext coexist, at least the multipart/alternative segment must be defined. What is “at least”? For example, if there is only plain text and a hypertext body, then expanding the type in the email header to define it as multipart/related or even multipart/mixed is allowed.

The common feature of multipart types is that the "boundary" parameter string is specified in the segment header, and each sub-segment in the segment body is delimited by this string. All sub-segments start with a "--" + boundary line, and the parent segment ends with a "--" + boundary + "--" line. Paragraphs are also separated by blank lines. In the case of a multipart message body, there may be some additional text lines at the beginning of the message body (before the first "--" +boundary line), which are equivalent to comments and should be ignored during decoding. There can also be some additional lines of text between paragraphs, which will not be displayed.

These composite types can be nested. For example, if an email has an attachment and a body in both HTML and text formats, the structure of the email is:

Content-Type: multipart/mixed

Part 1:

Content Type : multipart/alternative:

Text:

HTML format text

Part 2:

appendix

Mail terminator;

Since the composite type consists of multiple parts, a delimiter is needed to separate the multiple parts. This is what the boundary in the email source file above describes. For each content of Contect type :multipart/*, there will be such a description to indicate the separation between multiple parts.

When you view the source code of a MIME/BASE64-encoded email, it will generally contain a sentence like "This is a multi-part message in MIME format." It can also be decoded by most email programs, including Netscape, MS Mail, Eudora, etc. These programs can correctly identify the body of the email and restore the MIME/BASE64 encoded parts to the correct text or attached binary files.

3. Content-Transfer-Encoding

It indicates how this part of the document is encoded. Only by recognizing this description can it be decoded using the correct decoding method.

There are several types of Content-Transfer-Encoding, including Base64, Quoted-printable, 7bit, 8bit, Binary, etc.

Among them, 7bit is the default encoding method. Email source code was originally designed to be in the form of all printable ASCII code.

Non-ASCII text or data must be encoded into the required format.

Base64, Quoted-Printable is the most widely used encoding method in non-English countries.

The binary method is only symbolic and has no practical value.

4.boundary

This delimiter is a combination of ancient characters that cannot appear in the text. In the document, "--" plus this boundary is used to indicate the beginning of a section. At the end of the document, "--" plus boundary and then "--" at the end are used to indicate the end of the document. Since composite types can be nested, there may be multiple boundaries in an email.

<<:  Display mode of elements in CSS

>>:  Solution to the problem of a large number of php-cgi.exe processes on the server causing the CPU to occupy 100%

Recommend

Implementation of MySQL multi-version concurrency control MVCC

Transaction isolation level settings set global t...

Mysql auto-increment primary key id is not processed in this way

Mysql auto-increment primary key id does not incr...

Vue conditional rendering v-if and v-show

Table of contents 1. v-if 2. Use v-if on <temp...

Tutorial on using portainer to connect to remote docker

Portainer is a lightweight docker environment man...

Analysis of the reasons why MySQL's index system uses B+ tree

Table of contents 1. What is an index? 2. Why do ...

Analyze how a SQL query statement is executed in MySQL

Table of contents 1. Overview of MySQL Logical Ar...

How to enable remote access in Docker

Docker daemon socket The Docker daemon can listen...

Examples of MySQL and Python interaction

Table of contents 1. Prepare data Create a data t...

Solve the MySQL 5.7.9 version sql_mode=only_full_group_by problem

MySQL 5.7.9 version sql_mode=only_full_group_by i...

A simple method to modify the size of Nginx uploaded files

Original link: https://vien.tech/article/138 Pref...

What to do after installing Ubuntu 20.04 (beginner's guide)

Ubuntu 20.04 has been released, bringing many new...

How to implement variable expression selector in Vue

Table of contents Defining the HTML structure Inp...