URL encoding is a pretty simple thing, and has been around forever. Yet, it is associated with a significant fraction of bugs in web frameworks, libraries, and applications. Why is that? Is there a larger lesson here?
URL Encoding
If you’ve been lucky enough not to cross paths with HTTP in the last 30 years, then perhaps a refresher on URL encoding is in order.
URL encodings are behind the prevalence of %
in long URLs, such as https://example.com/download/test%2Flogs%2Fdebug.log
.
The encoding is a way of embedding arbitrary ASCII strings in a URI.
“Reserved” characters, !#$&'()*+,/:;=?@[]
, are replaced by %xx
where xx
is the hex representation of the ASCII code for the character.
So /
is replaced with %2F
or %2f
, :
by %3A
or %3a
, and so on.
The %
character itself maps to %25
.
There are some issues with this encoding.
Precisely when a character must be escaped depends on the context.
For example, if used in the path portion of a URI, /
has special meaning and must be escaped as in the example above, while =
has no special meaning.
But in the query portion of a URI, the opposite is true: =
separates keys from values, while /
has no special meaning.
Partial-Identity
I totally made up this term, but here’s why it makes sense.
An identity encoding (for example, in the Content-Encoding header) is an encoding that makes no changes – input equals output.
URL encoding acts like an identity encoding for some inputs – marble magic
encodes to marble magic
.
But for other inputs, it is not – omg! snakes!
encodes to omg%21 snakes%21
.
Hence, “partial identity”.
I assert that such encodings cause bugs and should be avoided in systems design.
Bugs
A bit of searching, or for many of us a moment’s recollection, will uncover a wealth of URL-encoding-related bugs.
The simplest class of bugs occurs when two communicating systems don’t agree about when or what to encode or decode.
This can occur when a client generates a URL naively, such as url = base_url + '/' + filename
.