A group of friends encountered a program error due to the URL encoding of the character *
not meeting expectations, so this article tests how different programming languages implement URL encoding.
Related Standards#
Since RFC 1738: Uniform Resource Locators (URL) is not an Internet Standard, this article refers to the Internet Standard RFC 3986: Uniform Resource Identifier (URI): Generic Syntax. This standard recommends using the general term "URI" instead of the more restrictive terms "URL" and "URN" (RFC3305).
RFC 3986 defines unreserved characters in a URI as follows:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
When encoding a URI, unreserved characters should remain unescaped, but the standard also states that if a URI encoding escapes these characters, they must be restored to their original characters during decoding.
URIs that differ in the replacement of an unreserved character with
its corresponding percent-encoded US-ASCII octet are equivalent: they
identify the same resource. However, URI comparison implementations
do not always perform normalization prior to comparison (see Section
6). For consistency, percent-encoded octets in the ranges of ALPHA
(%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
underscore (%5F), or tilde (%7E) should not be created by URI
producers and, when found in a URI, should be decoded to their
corresponding unreserved characters by URI normalizers.
The standard also points out that the ~
character is often escaped as %7E
in older URI encoding implementations.
For example, the octet
corresponding to the tilde ("~") character is often encoded as "%7E"
by older URI processing implementations; the "%7E" can be replaced by
"~" without changing its interpretation.
For reserved characters that may need to be escaped, the standard divides them into two categories:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
Among these, gen-delims
is related to the structure of the URI and must be escaped, while whether sub-delims
need to be escaped depends on their position. In particular, since escaping uses the %
symbol, the %
symbol itself also needs to be escaped.
A typical URI consists of the following components:
foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
| _____________________|__
/ \ / \
urn:example:animal:ferret:nose
The grammar fragments related to sub-delims
are as follows:
authority = [ userinfo "@" ] host [ ":" port ]
userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
host = IP-literal / IPv4address / reg-name
IP-literal = "[" ( IPv6address / IPvFuture ) "]"
IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
reg-name = *( unreserved / pct-encoded / sub-delims )
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
; non-zero-length segment without any colon ":"
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
query = *( pchar / "/" / "?" )
fragment = *( pchar / "/" / "?" )
According to the above grammar, characters in sub-delims
may remain unchanged in authority
, path
, query
, and fragment
.
Additionally, the space character is encoded as +
in application/x-www-form-urlencoded
type, while in RFC 3986, it is encoded as %20
.
To identify the differences in how various high-level languages handle the escaping of these characters, a simple test was conducted, with the test results provided first, followed by the specific test code and output at the end.
Test Results#
Only the encoding and decoding situations in the query
segment were tested. In all encoding tests, characters in sub-delims
were encoded, while special characters in unreserved
were not encoded as reference results. A table of characters that differ from the reference results is noted, and the escaping situation for spaces is listed separately. The decoding test used a string with all special characters escaped, and since the decoding results were the same, they are not additionally displayed in the table.
Language | Module / Function | sub-delims Not Escaped | unreserved Escaped | SP Encoding | + Decoding |
---|---|---|---|---|---|
Python 3 | urllib.parse | + | Must use unquote_plus | ||
Go | net/url | + | |||
Java | java.net.URLEncoder java.net.URLDecoder | * | ~ | + | |
JavaScript | URLSearchParams | * | ~ | + | |
JavaScript | encodeURIComponent decodeURIComponent | * | ~ | %20 | Cannot decode + |
Node.js | querystring | !'()* | %20 | ||
C# | System.Net.WebUtility | !()* | + | ||
PHP | urlencode urldecode | ~ | + | ||
PHP | rawurlencode rawurldecode | %20 | Cannot decode + |
Although the handling of symbol escaping differs during encoding, all tested programs were able to correctly decode when tested with fully escaped sub-delims
and special characters in unreserved
.
Test Code#
Python 3:
from urllib.parse import urlencode, unquote, unquote_plus
print(urlencode({"param":" !$&'()*+,;=-._~"}))
print(unquote("param=a+b"))
print(unquote_plus("param=a+b"))
param=+%21%24%26%27%28%29%2A%2B%2C%3B%3D-._~
param=a+b
param=a b
Go:
package main
import (
"fmt"
"net/url"
)
func main() {
fmt.Println(url.QueryEscape(" !$&'()*+,;=-._~"))
fmt.Println(url.QueryUnescape("a+b"))
}
+%21%24%26%27%28%29%2A%2B%2C%3B%3D-._~
a b <nil>
Java:
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.net.URLEncoder;
import java.nio.charset.StandardCharsets;
public class Main {
public static void main(String[] args) throws UnsupportedEncodingException {
System.out.println(URLEncoder.encode(" !$&'()*+,;=-._~", StandardCharsets.UTF_8.toString()));
System.out.println(URLDecoder.decode("a+b", StandardCharsets.UTF_8.toString()));
}
}
+%21%24%26%27%28%29*%2B%2C%3B%3D-._%7E
a b
JavaScript:
const encode = new URLSearchParams();
encode.set("param", " !$&'()*+,;=-._~");
console.log(encode.toString());
const decode = new URLSearchParams("param=a+b");
console.log(decode.get("param"));
console.log(encodeURIComponent(" !$&'()*+,;=-._~"));
console.log(decodeURIComponent("a+b"));
param=+%21%24%26%27%28%29*%2B%2C%3B%3D-._%7E
a b
%20!%24%26'()*%2B%2C%3B%3D-._~
a+b
Node.js:
const querystring = require("querystring");
console.log(querystring.stringify({ param: " !$&'()*+,;=-._~" }));
console.log(querystring.parse("param=a+b").param);
param=%20!%24%26'()*%2B%2C%3B%3D-._~
a b
C#:
using System;
class Program
{
static void Main()
{
Console.WriteLine(System.Net.WebUtility.UrlEncode(" !$&'()*+,;=-._~"));
Console.WriteLine(System.Net.WebUtility.UrlDecode("a+b"));
}
}
+!%24%26%27()*%2B%2C%3B%3D-._%7E
a b
PHP:
<?php
echo urlencode(" !$&'()*+,;=-._~") . "\n";
echo urldecode("a+b") . "\n";
echo rawurlencode(" !$&'()*+,;=-._~") . "\n";
echo rawurldecode("a+b") . "\n";
?>
+%21%24%26%27%28%29%2A%2B%2C%3B%3D-._%7E
a b
%20%21%24%26%27%28%29%2A%2B%2C%3B%3D-._~
a+b