異なるプログラミング言語のURLエンコーディングの違い

水群時に、URL の文字 * のエンコーディングが期待通りでない問題に直面した友人がいたため、プログラミング言語の URL エンコーディングの実装の違いについてテストを行いました。

関連標準#

RFC 1738: Uniform Resource Locators (URL) はインターネット標準ではないため、この記事ではインターネット標準 RFC 3986: Uniform Resource Identifier (URI): Generic Syntax を参考にしています。この標準は、より制限の少ない用語「URI」を使用することを推奨しており、「URL」や「URN」という用語は制限が強すぎるとされています (RFC3305)。

RFC 3986 は、URI における非予約文字の定義を以下のように示しています：

unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

URI エンコーディングの際、非予約文字 unreserved はエスケープせずに保持する必要がありますが、この標準は、これらの文字がエスケープされた URI エンコーディングに遭遇した場合、デコード時に元の文字に戻す必要があることも示しています。

URIs that differ in the replacement of an unreserved character with
its corresponding percent-encoded US-ASCII octet are equivalent: they
identify the same resource.  However, URI comparison implementations
do not always perform normalization prior to comparison (see Section
6).  For consistency, percent-encoded octets in the ranges of ALPHA
(%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
underscore (%5F), or tilde (%7E) should not be created by URI
producers and, when found in a URI, should be decoded to their
corresponding unreserved characters by URI normalizers.

この標準では、~ 文字が古い URI エンコーディング実装で %7E にエスケープされることが多いとも指摘しています。

For example, the octet
corresponding to the tilde ("~") character is often encoded as "%7E"
by older URI processing implementations; the "%7E" can be replaced by
"~" without changing its interpretation.

エスケープが必要な予約文字は、以下の 2 つのカテゴリに分類されます：

reserved    = gen-delims / sub-delims
gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

ここで、gen-delims は URI の構造に関連しており、エスケープが必要ですが、sub-delims はその位置によって判断する必要があります。特に、エスケープには % 記号が使用されるため、% 記号自体もエスケープする必要があります。

典型的な URI の構成要素は以下の通りです：

      foo://example.com:8042/over/there?name=ferret#nose
      \_/   \______________/\_________/ \_________/ \__/
       |           |            |            |        |
    scheme     authority       path        query   fragment
       |   _____________________|__
      / \ /                        \
      urn:example:animal:ferret:nose

sub-delims に関連する文法の断片は以下の通りです：

authority     = [ userinfo "@" ] host [ ":" port ]
userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )

host          = IP-literal / IPv4address / reg-name
IP-literal    = "[" ( IPv6address / IPvFuture  ) "]"
IPvFuture     = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
reg-name      = *( unreserved / pct-encoded / sub-delims )

path          = path-abempty    ; begins with "/" or is empty
              / path-absolute   ; begins with "/" but not "//"
              / path-noscheme   ; begins with a non-colon segment
              / path-rootless   ; begins with a segment
              / path-empty      ; zero characters
path-abempty  = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty    = 0<pchar>
segment       = *pchar
segment-nz    = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
              ; non-zero-length segment without any colon ":"
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

query         = *( pchar / "/" / "?" )

fragment      = *( pchar / "/" / "?" )

上記の文法に従うと、sub-delims の文字は authority path query fragment の中でそのまま保持される可能性があります。

また、空白文字は application/x-www-form-urlencoded タイプでは + としてエンコードされ、RFC 3986 では %20 としてエンコードされます。

異なる高級言語におけるこれらの文字のエスケープ処理の違いを見つけるため、以下に簡単なテストを行い、テスト結果を示し、具体的なテストコードと出力を最後に示します。

テスト結果#

query セクションのエンコーディングとデコーディングの状況のみをテストしました。すべてのエンコーディングテストにおいて、sub-delims の文字はすべてエンコードされ、unreserved の特殊文字はエンコードされていないことを参考結果とし、参考結果と異なる文字を示した表を作成し、空白のエスケープ状況を別に列挙しました。デコーディングテストでは、すべての特殊文字がエスケープされた文字列を使用しましたが、デコーディング結果はすべて同じであるため、表には追加で表示しません。

言語	モジュール / 関数	`sub-delims` エスケープされていない	`unreserved` エスケープされている	SP エンコーディング	`+` デコーディング
Python 3	`urllib.parse`			`+`	需使用 `unquote_plus`
Go	`net/url`			`+`
Java	`java.net.URLEncoder` `java.net.URLDecoder`	`*`	`~`	`+`
JavaScript	`URLSearchParams`	`*`	`~`	`+`
JavaScript	`encodeURIComponent` `decodeURIComponent`	`*`	`~`	`%20`	无法解码 `+`
Node.js	`querystring`	`!'()*`		`%20`
C#	`System.Net.WebUtility`	`!()*`		`+`
PHP	`urlencode` `urldecode`		`~`	`+`
PHP	`rawurlencode` `rawurldecode`			`%20`	无法解码 `+`

シンボルのエスケープ処理は異なりますが、すべてエスケープされた sub-delims および unreserved の特殊文字を使用したテストでは、テストされたプログラムはすべて正しくデコードできました。

テストコード#

Python 3:

from urllib.parse import urlencode, unquote, unquote_plus

print(urlencode({"param":" !$&'()*+,;=-._~"}))
print(unquote("param=a+b"))
print(unquote_plus("param=a+b"))

param=+%21%24%26%27%28%29%2A%2B%2C%3B%3D-._~
param=a+b
param=a b

Go:

package main

import (
    "fmt"
    "net/url"
)

func main() {
    fmt.Println(url.QueryEscape(" !$&'()*+,;=-._~"))
    fmt.Println(url.QueryUnescape("a+b"))
}

+%21%24%26%27%28%29%2A%2B%2C%3B%3D-._~
a b <nil>

Java:

import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.net.URLEncoder;
import java.nio.charset.StandardCharsets;

public class Main {
    public static void main(String[] args) throws UnsupportedEncodingException {
        System.out.println(URLEncoder.encode(" !$&'()*+,;=-._~", StandardCharsets.UTF_8.toString()));
        System.out.println(URLDecoder.decode("a+b", StandardCharsets.UTF_8.toString()));
    }
}

+%21%24%26%27%28%29*%2B%2C%3B%3D-._%7E
a b

JavaScript:

const encode = new URLSearchParams();
encode.set("param", " !$&'()*+,;=-._~");
console.log(encode.toString());
const decode = new URLSearchParams("param=a+b");
console.log(decode.get("param"));
console.log(encodeURIComponent(" !$&'()*+,;=-._~"));
console.log(decodeURIComponent("a+b"));

param=+%21%24%26%27%28%29*%2B%2C%3B%3D-._%7E
a b
%20!%24%26'()*%2B%2C%3B%3D-._~
a+b

Node.js:

const querystring = require("querystring");
console.log(querystring.stringify({ param: " !$&'()*+,;=-._~" }));
console.log(querystring.parse("param=a+b").param);

param=%20!%24%26'()*%2B%2C%3B%3D-._~
a b

C#:

using System;

class Program
{
    static void Main()
    {
        Console.WriteLine(System.Net.WebUtility.UrlEncode(" !$&'()*+,;=-._~"));
        Console.WriteLine(System.Net.WebUtility.UrlDecode("a+b"));
	}
}

+!%24%26%27()*%2B%2C%3B%3D-._%7E
a b

PHP:

<?php
echo urlencode(" !$&'()*+,;=-._~") . "\n";
echo urldecode("a+b") . "\n";
echo rawurlencode(" !$&'()*+,;=-._~") . "\n";
echo rawurldecode("a+b") . "\n";
?>

+%21%24%26%27%28%29%2A%2B%2C%3B%3D-._%7E
a b
%20%21%24%26%27%28%29%2A%2B%2C%3B%3D-._~
a+b