Special characters in URLsCharacter - URL encoded value
URL special character escape, some characters in the URL have special meanings, the basic encoding rules are as follows:1. Replace spaces with plus signs (+) 2. Forward slash (/) separates directories and subdirectories 3. Question mark (?) separates URL and query 4. The percent sign (%) specifies a special character 5. # specifies a bookmark 6. & separates parameters If you need to use it in a URL, you need to replace these special characters with their corresponding hexadecimal values.
This article mainly introduces the related issues of URI encoding and decoding, and gives a detailed explanation of which characters need to be encoded in Url encoding and why they need to be encoded. It also compares and analyzes several pairs of functions related to encoding and decoding in Javascript: escape/unescape, encodeURI/decodeURI and encodeURIComponent/decodeURIComponent. Prerequisites
URI stands for Uniform Resource Identifier. Usually, what we call Url is just a type of URI. The format of a typical URL is shown above. The Url encoding mentioned below should actually refer to URI encoding. Why do we need URL encoding?Usually if something needs to be encoded, it's not suitable for transmission. There are many reasons, such as the size is too large, contains private data, and for Url, encoding is required because some characters in the Url may cause ambiguity. For example, the Url parameter string uses the key=value pair format to pass parameters, and the key-value pairs are separated by the & symbol, such as /s?q=abc& ie=utf-8. If your value string contains = or &, it will inevitably cause a parsing error on the server receiving the URL, so the ambiguous & and = symbols must be escaped, that is, encoded. For example, the encoding format of Url uses ASCII code, not Unicode, which means that you cannot include any non-ASCII characters, such as Chinese, in the Url. Otherwise, if the client browser and the server browser support different character sets, Chinese may cause problems. The principle of URL encoding is to use safe characters (printable characters with no special purpose or special meaning) to represent unsafe characters. Which characters need to be encoded?The RFC3986 document stipulates that a URL is only allowed to contain English letters (a-zA-Z), numbers (0-9), -_.~4 special characters, and all reserved characters. The RFC3986 document provides detailed recommendations on URL encoding and decoding, indicating which characters need to be encoded so as not to cause a change in the semantics of the URL, and provides corresponding explanations for why these characters need to be encoded. There is no corresponding printable character in the US-ASCII character setOnly printable characters are allowed in URLs. Bytes 10-7F in the US-ASCII code all represent control characters, and these characters cannot appear directly in a URL. At the same time, 80-FF bytes (ISO-8859-1) cannot be placed in a URL because they are beyond the byte range defined by US-ACII. Reserved charactersURL can be divided into several components, such as protocol, host, path, etc. Some characters (:/?#[]@) are used to separate different components. For example: a colon is used to separate the protocol from the host, / is used to separate the host from the path, ? is used to separate the path from the query parameters, and so on. There are also some characters (!$&'()*+,;=) used to separate each component, such as = is used to represent the key-value pairs in the query parameters, and the & symbol is used to separate multiple key-value pairs in the query. When normal data in a component contains these special characters, they need to be encoded. RFC3986 specifies the following characters as reserved characters:
Unsafe charactersThere are also some characters that, when placed directly in a URL, may cause ambiguity in the parser. These characters are considered unsafe for a number of reasons.
It should be noted that for legal characters in a URL, encoding and not encoding are equivalent, but for the characters mentioned above, if they are not encoded, they may cause differences in the semantics of the URL. Therefore, for URLs, only common English characters and numbers, special characters $-_.+!*'() and reserved characters can appear in unencoded URLs. Other characters need to be encoded before they can appear in the URL. However, due to historical reasons, there are still some non-standard encoding implementations. For example, for the ~ symbol, although the RFC3986 document stipulates that the tilde symbol ~ does not need to be URL encoded, there are still many old gateways or transmission agents that will How to encode illegal characters in UrlUrl encoding is often also called percent encoding (Url Encoding, also known as percent-encoding) because its encoding method is very simple, using the percent sign plus two characters - 0123456789ABCDEF - to represent a byte in hexadecimal form. The default character set used for URL encoding is US-ASCII. For example, the byte corresponding to a in US-ASCII code is 0x61, so the result after Url encoding is %61. When we enter http://g.cn/search?q=%61%62%63 in the address bar, it is actually equivalent to searching for abc on Google. For example, the byte corresponding to the @ symbol in the ASCII character set is 0x40, and after Url encoding, it becomes %40. List of common characters in Url encoding:
For non-ASCII characters, you need to use a superset of the ASCII character set to encode the corresponding bytes, and then perform percent encoding on each byte. For Unicode characters, the RFC document recommends using UTF-8 to encode them to get the corresponding bytes, and then performing percent encoding on each byte. For example, the bytes of "Chinese" using the UTF-8 character set are 0xE4 0xB8 0xAD 0xE6 0x96 0x87, and after Url encoding, we get "%E4%B8%AD%E6%96%87". If a byte corresponds to a non-reserved character in the ASCII character set, the byte does not need to be represented by a percent sign. For example, "Url encoding" is encoded with UTF-8 and the bytes are 0x55 0x72 0x6C 0xE7 0xBC 0x96 0xE7 0xA0 0x81. Since the first three bytes correspond to the non-reserved character "Url" in ASCII, these three bytes can be represented by the non-reserved character "Url". The final Url encoding can be simplified to "Url%E7%BC%96%E7%A0%81". Of course, if you use "%55%72%6C%E7%BC%96%E7%A0%81", it is also acceptable. Due to historical reasons, some URL encoding implementations do not fully follow this principle, which will be mentioned below. Differences among escape, encodeURI and encodeURIComponent in JavascriptJavascript provides three pairs of functions to encode the URL to obtain a legal URL, namely escape/unescape, encodeURI/decodeURI and encodeURIComponent/decodeURIComponent. Since the decoding and encoding processes are reversible, only the encoding process is explained here. These three encoding functions - escape, encodeURI, encodeURIComponent - are all used to convert unsafe and illegal URL characters into legal URL character representations. They have the following differences. Security characters are different The following table lists the safe characters for these three functions (that is, the functions will not encode these characters)
Different compatibilityThe escape function has existed since Javascript 1.0, while the other two functions were introduced in Javascript 1.5. However, since Javascript 1.5 is already very popular, there is actually no compatibility issue when using encodeURI and encodeURIComponent. Different encoding methods for Unicode charactersThese three functions use the same encoding method for ASCII characters, which is represented by a percent sign + two hexadecimal characters. However, for Unicode characters, the escape encoding is %uxxxx, where xxxx is a 4-digit hexadecimal character used to represent the Unicode character. This method has been deprecated by W3C. However, the escape encoding syntax is still retained in the ECMA-262 standard. encodeURI and encodeURIComponent use UTF-8 to encode non-ASCII characters and then percent-encode them. This is recommended by the RFC. Therefore, it is recommended to use these two functions instead of escape for encoding as much as possible. Applicable to different occasionsencodeURI is used to encode a complete URI, while encodeURIComponent is used to encode a component of a URI. From the safe character range table mentioned above, we can see that the character range encoded by encodeURIComponent is larger than that of encodeURI. As we mentioned above, reserved characters are generally used to separate URI components (a URI can be cut into multiple components, refer to the prerequisite knowledge section) or subcomponents (such as the separator of query parameters in the URI). For example, the : character is used to separate the scheme and the host, and the ? character is used to separate the host and the path. Since the object manipulated by encodeURI is a complete URI, these characters have special uses in URI. Therefore, these reserved characters will not be encoded by encodeURI, otherwise the meaning will change. The components have their own data representation format, but these data cannot contain reserved characters that separate components, otherwise the separation of components in the entire URI will be confused. Therefore, when using encodeURIComponent for a single component, more characters need to be encoded. Form SubmissionWhen an Html form is submitted, each form field will be URL-encoded before being sent. Due to historical reasons, the URL encoding implementation used by the form does not conform to the latest standards. For example, the encoding used for spaces is not %20, but a + sign. If the form is submitted using the Post method, we can see a Content-Type header in the HTTP header with a value of application/x-www-form-urlencoded. Most applications can handle this non-standard implementation of Url encoding, but in the client-side Javascript, there is no function that can decode the + sign into a space, so you can only write your own conversion function. Also, for non-ASCII characters, the coded character set used depends on the character set used by the current document. For example, we add in the Html header <meta http-equiv="Content-Type" content="text/html; charset=gb2312" /> In this way, the browser will use gb2312 to render the document (note that when this meta tag is not set in the HTML document, the browser will automatically select the character set based on the current user's preferences, and the user can also force the current website to use a specified character set). When submitting the form, the character set used for Url encoding is gb2312. Does document character set affect encodeURI?I encountered a very confusing problem when using Aptana before (the reason why I specifically mentioned aptana will be mentioned below). When I used encodeURI, I found that the result it encoded was very different from what I expected. Here is my sample code: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=gb2312" /> </head> <body> <script type="text/javascript"> document.write(encodeURI("中文")); </script> </body> </html> The running result output is %E6%B6%93%EE%85%9F%E6%9E%83. Obviously, this is not the result of using the UTF-8 character set for URL encoding (search for "Chinese" on Google, and the URL shows %E4%B8%AD%E6%96%87). So I was very skeptical at the time, is encodeURI also related to page encoding, but I found that under normal circumstances, if you use gb2312 for Url encoding, you will not get this result. Later I finally discovered that the problem was caused by the inconsistency between the character set used for page file storage and the character set specified in the Meta tag. Aptana's editor uses the UTF-8 character set by default. That is to say, the UTF-8 character set is used when this file is actually stored. However, since gb2312 is specified in the Meta tag, the browser will parse the document according to gb2312. Naturally, an error will occur in the "中文" string, because the bytes obtained after the "中文" string is encoded with UTF-8 are 0xE4 0xB8 0xAD 0xE6 0x96 0x87, and these 6 bytes are decoded by the browser with gb2312, then another three Chinese characters "涓枃" will be obtained (one Chinese character occupies two bytes in GBK), and the result of these three Chinese characters after being passed into the encodeURI function is %E6%B6%93%EE%85%9F%E6%9E%83. Therefore, encodeURI still uses UTF-8 and will not be affected by the page character set. Other issues related to URL encodingDifferent browsers have different responses when it comes to handling URLs containing Chinese characters. For example, for IE, if you check the advanced setting "Always send URL in UTF-8", the Chinese characters in the path part of the URL will be encoded in UTF-8 and sent to the server, while the Chinese parts in the query parameters will be encoded in the system default character set. In order to ensure maximum interoperability, it is recommended that all components placed in the URL explicitly specify a character set for URL encoding, rather than relying on the browser's default implementation. In addition, many HTTP monitoring tools or browser address bars will automatically decode the URL (using the UTF-8 character set) when displaying the URL. This is why when you visit Google in Firefox to search for Chinese, the URL displayed in the address bar contains Chinese. But in fact, the original URL sent to the server is still encoded. You can see this by accessing location.href in the address bar using Javascript. Don't be fooled by these illusions when studying Url encoding and decoding. The above is my personal experience. I hope it can give you a reference. I also hope that you will support 123WORDPRESS.COM. You may also be interested in:
|
<<: 100 ways to change the color of an image using CSS (worth collecting)
>>: About the problem of no virtual network card after VMware installation
Table of contents nonsense text The first router/...
HTML form tag tutorial, this section mainly expla...
Table of contents Preface Basic Introduction Code...
npm uninstall sudo npm uninstall npm -g If you en...
This article shares with you a draggable photo wa...
1. Connect to MySQL Format: mysql -h host address...
1. MacVlan There are many solutions to achieve cr...
XML Schema is an XML-based alternative to DTD. XM...
Table of contents Preface Why do we need to encap...
What is the role of http in node The responsibili...
The operating environment of this tutorial: Windo...
Generic load/write methods Manually specify optio...
Here is a text scrolling effect implemented with ...
1. Use the mysql/mysql-server:latest image to qui...
Preface The Boost library is a portable, source-c...