Code Vigorous


navigation
home
github
mozillians
email
about
Dustin J. Mitchell

The Horrors of Partial-Identity Encodings -- or -- URL Encoding Is Hard

25 Jan 2021

URL encoding is a pretty simple thing, and has been around forever. Yet, it is associated with a significant fraction of bugs in web frameworks, libraries, and applications. Why is that? Is there a larger lesson here?

URL Encoding

If you’ve been lucky enough not to cross paths with HTTP in the last 30 years, then perhaps a refresher on URL encoding is in order. URL encodings are behind the prevalence of % in long URLs, such as https://example.com/download/test%2Flogs%2Fdebug.log.

The encoding is a way of embedding arbitrary ASCII strings in a URI. “Reserved” characters, !#$&'()*+,/:;=?@[], are replaced by %xx where xx is the hex representation of the ASCII code for the character. So / is replaced with %2F or %2f, : by %3A or %3a, and so on. The % character itself maps to %25.

There are some issues with this encoding.

Precisely when a character must be escaped depends on the context. For example, if used in the path portion of a URI, / has special meaning and must be escaped as in the example above, while = has no special meaning. But in the query portion of a URI, the opposite is true: = separates keys from values, while / has no special meaning.

Partial-Identity

I totally made up this term, but here’s why it makes sense. An identity encoding (for example, in the Content-Encoding header) is an encoding that makes no changes – input equals output. URL encoding acts like an identity encoding for some inputs – marble magic encodes to marble magic. But for other inputs, it is not – omg! snakes! encodes to omg%21 snakes%21. Hence, “partial identity”.

I assert that such encodings cause bugs and should be avoided in systems design.

Bugs

A bit of searching, or for many of us a moment’s recollection, will uncover a wealth of URL-encoding-related bugs.

The simplest class of bugs occurs when the HTTP client or server just ignores encoding entirely. This can occur when a client generates a URL naively, such as url = base_url + '/' + filename. This will work fine for simple filenames. Depending on the server implementation, it may even work for filenames containing /. But what about a filename containing ??

The same class of bug might occur when a server fails to decode, but the client (say, a browser) encodes the value. This can result in such as “Claire%27s Cutlery”.

Already we see why these bugs occur so often: they are difficult to test for, or at least to remember to test for. Because the common cases are all in the identity portion of the encoding, basic tests like filename = "foo.txt" or store_name = "Widget Shoppe" succeed.

A second class of errors are due to ambiguity over which characters should be encoded. For example, RFC 3986 specifies that ' is a reserved character, but JS’s encodeURIComponent function does not encode it. In practice, this ambiguity rarely causes issues directly, because the most common delimiters in URLs – /, =, &, and ? – are just about always encoded (unless you accidentally use encodeURI). Decoders typically decode anything with a %, so an over-encoded value such as %66%6f%6f is decoded to foo, avoiding any concern of not decoding an unexpectedly-encoded character. However, this lenience has led to a number of security issues where filters meant to prevent malicious behavior could be bypassed by simply over-encoding a value.

A less common, but difficult-to-uncover variant comes from applying encoding or decoding too many times. I encountered this issue in the history package, which urldecodes a value twice and then “fixes” that issue by later encoding the value. It seems like the result should be equivalent to a single decoding. The problem is, decoding is not a reversible operation – it loses information about which characters were encoded. In the example in the linked issue, the inputs are %2f and %252f. Decoding each of these twice gives / in both cases, losing the distinction between the two.

Unicode

Similar problems plague Unicode, and in fact the effect is common enough to have a name, Mojibake. The underlying cause is similar: most character encodings use the same codepoints for simple ASCII characters. This means that, for example, an HTML document encoded with UTF-8 will successfully decode to valid HTML as Latin-1, but non-ASCII characters in the content will be interpreted incorrectly.

Lessons

In cases where we have to deal with partial-identity encodings, it is best to treat them precisely and explicitly. In type-safe languages such as Rust, wrapper types can help:

struct UrlEncodedString(String);

impl UrlEncodedString {
    fn encode(raw: &str) -> Self { /* ... */ }
    fn decode(&self) -> String { /* ... */ }
}

With this type in place, the compiler can detect any confusion between encoded and decoded strings.

In languages with looser type systems, such as C, JavaScript (but consider TypeScript!) or Python, careful attention in the form of comments and precise variable names can accomplish a similar effect. For example, encodedHash = location.hash helpfully reminds the reader that the hash is encoded.

And, of course, extensive testing using lots of different input strings will help discover mistakes. In most cases, libraries and frameworks should perform the encoding and decoding for you, although that only shifts the responsibility onto the authors of those libraries (I’m looking at you, history).

By comparison, encodings like base64 or gzip do not suffer from this kind of problem. The reason is simple: they are not partial-identity encodings, so there is precisely one correct way to use them. Any other approach will fail in all non-trivial cases, showing failures under even the most rudimentary tests. The larger lesson, then, is that partial-identity encodings should be avoided wherever possible.