Now while I am sure
Arthur Miller was referring to writing a play when he said those words they have an oddly appropriate resonance for my topic.
In the early nineties
Lou Montulli applied the idea of
magic cookies to HTTP to make the web stateful, I imagine he had no idea of the issues he was going to introduce for the future. Like most of the web technology it was a solution to an immediate problem which it has never been possible to subsequently improve.
The
HTTP cookie is simply a way for a website to identify a connecting browser session so that state can be kept between retrieving pages. Due to shortcomings in the design of cookies and implementation details in browsers this has lead to a selection of unwanted side effects. The specific issue that I am talking about here is the
supercookie where the super prefix in this context has similar connotations as to when applied to the word villain.
Whenever the browser requests a resource (web page, image, etc.) the server may return a cookie along with the resource that your browser remembers. The cookie has a domain name associated with it and when your browser requests additional resources if the cookie domain matches the requested resources domain name the cookie is sent along with the request.
As an example the first time you visit a page on
www.example.foo.invalid you might receive a cookie with the domain
example.foo.invalid so next time you visit a page on
www.example.foo.invalid your browser will send the cookie along. Indeed it will also send it along for any page on
another.example.foo.invalid
A supercookies is simply one where instead of being limited to one sub-domain (
example.foo.invalid) the cookie is set for a
top level domain (
foo.invalid) so visiting any such domain (I used the invalid name in my examples but one could substitute
com or
co.uk) your web browser gives out the cookie. Hackers would love to be able to set up such cookies and potentially control and hijack many sites at a time.
This problem was noted early on and browsers were not allowed to set cookie domains with fewer than two parts so
example.invalid or
example.com were allowed but
invalid or
com on their own were not. This works fine for top level domains like
.com,
.org and
.mil but not for countries where the domain registrar had rules about second levels like the uk domain (uk domains must have a second level like
.co.uk).
There is no way to generate the correct set of top level domains with an algorithm so a database is required and is called the
Public Suffix List (PSL). This database is a simple text formatted list with wildcard and inversion syntax and is at time of writing around 180Kb of text including comments which compresses down to 60Kb or so with deflate.
A few years ago with ICANN allowing the great expansion of top level domains the existing NetSurf supercookie handling was found to be wanting and I decided to implement a solution using the PSL. At this point in time the database was only 100Kb source or 40Kb compressed.
I started by looking at limited existing libraries. In fact only the
regdom library was adequate but used 150Kb of heap to load the pre-processed list. This would have had the drawback of increasing NetSurf heap usage significantly (we still have users on 8Mb systems). Because of this and the need to run PHP script to generate the pre-processed input it was decided the library was not suitable.
Lacking other choices I came up with my own implementation which used a perl script to construct a tree of domains from the PSL in a static array with the label strings in a separate table. At the time my implementation added 70Kb of read only data which I thought reasonable and allowed for direct lookup of answers from the database.
This solution still required a pre-processing step to generate the C source code but perl is much more readily available, is a language already used by our tooling and we could always simply ship the generated file. As long as the generated file was updated at release time as we already do for our fallback SSL certificate root set this would be acceptable.
I put the solution into NetSurf, was
pleased no-one seemed to notice and moved on to other issues. Recently while fixing a completely unrelated issue in the display of session cookies in the management interface and I realised I had some test supercookies present in the display. After the initial "thats odd" I realised with horror there might be a deeper issue.
It quickly became evident the PSL generation was broken and had been for a long time, even worse somewhere along the line the "redundant" empty generated source file had been removed and the ancient fallback code path was all that had been used.
This issue had escalated somewhat from a trivial display problem. I took a moment to asses the situation a bit more broadly and came to the conclusion there were a number of interconnected causes, centered around the lack of automated testing, which could be solved by extracting the PSL handling into a "support" library.
NetSurf has several of these support libraries which could be used separately to the main browser project but are principally oriented towards it. These libraries are shipped and built in releases alongside the main browser codebase and mainly serve to make API more obvious and modular. In this case my main aim was to have the functionality segregated into a separate module which could be tested, updated and monitored directly by
our CI system meaning the embarrassing failure I had found can never occur again.
Before creating my own library I did consider a library called
libpsl had been created since I wrote my original implementation. Initially I was very interested in using this library given it managed a data representation within a mere 32Kb.
Unfortunately the library integrates a great deal of IDN and punycode handling which was not required in this use case. NetSurf already has to handle IDN and punycode translations and uses punycode encoded domain names internally only translating to unicode representations for display so duplicating this functionality using other libraries requires a great deal of resource above the raw data representation.
I put
the library together based on the existing code generator Perl program and integrated the test set that comes along with the PSL. I was a little alarmed to discover that the PSL had almost doubled in size since the implementation was originally written and now the trivial test program of the library was weighing in at a hefty 120Kb.
This stemmed from two main causes:
- there were now many more domain label strings to be stored
- there now being many, many more nodes in the tree.
To address the first cause the length of the domain label strings was moved into the unused padding space within each tree node removing a byte from each domain label saving 6Kb. Next it occurred to me that while building the domain label string table that if the label to be added already existed as a substring within the table it could be elided.
The domain labels were sorted from longest to shortest and added in order searching for substring matches as the table was built this saved another 6Kb. I am sure there are ways to reduce this further I have missed (if you see them let me know!) but a 25% saving (47Kb to 35Kb) was a good start.
The second cause was a little harder to address. The structure representing nodes in the tree I started with was at first look reasonable.
struct pnode {
uint16_t label_index; /* index into string table of label */
uint16_t label_length; /* length of label */
uint16_t child_node_index; /* index of first child node */
uint16_t child_node_count; /* number of child nodes */
};
I examined the generated table and observed that the majority of nodes were leaf nodes (had no children) which makes sense given the type of data being represented. By allowing two types of node one for labels and a second for the child node information this would halve the node size in most cases and requiring only a modest change to the tree traversal code.
The only issue with this would be that a way to indicate a node has child information. It was realised that the domain labels can have a maximum length of 63 characters meaning their length can be represented in six bits so a uint16_t was excessive. The space was split into two uint8_t parts one for the length and one for a flag to indicate child data node followed.
union pnode {
struct {
uint16_t index; /* index into string table of label */
uint8_t length; /* length of label */
uint8_t has_children; /* the next table entry is a child node */
} label;
struct {
uint16_t node_index; /* index of first child node */
uint16_t node_count; /* number of child nodes */
} child;
};
static const union pnode pnodes[8580] = {
/* root entry */
{ .label = { 0, 0, 1 } }, { .child = { 2, 1553 } },
/* entries 2 to 1794 */
{ .label = {37, 2, 1 } }, { .child = { 1795, 6 } },
...
/* entries 8577 to 8578 */
{ .label = {31820, 6, 1 } }, { .child = { 8579, 1 } },
/* entry 8579 */
{ .label = {0, 1, 0 } },
};
This change reduced the node array size from 63Kb to 33Kb almost a 50% saving. I considered using bitfields to try and reduce the label length and has_children flag into a single byte but such packing will not reduce the length of a node below 32bits because it is unioned with the child structure.
A possibility of using the spare uint8_t derived by bitfield packing to store an additional label node in three other nodes was considered but added a great deal of complexity to node lookup and table construction for saving around 4Kb so was not incorporated.
With the changes incorporated the test program was a much more acceptable 75Kb reasonably close to the size of the compressed source but with the benefits of direct lookup. Integrating the libraries single API call into NetSurf was straightforward and resulted in correct operation when tested.
This episode just reminded me of the dangers of code that can fail silently. It exposed our users to a security problem that we thought had been addressed almost six years ago and squandered the limited resources of the project. Hopefully a lesson we will not have to learn again any time soon. If there is a positive to take away it is that the new implementation is more space efficient, automatically built and importantly
tested