Splay trees, compression, encryption, and embedding (by )

Quite often, when designing things like file formats, one comes across the problem of embedding. For example, look at HTML; when one wants to say:

I like this one

One has to write, in HTML:

I like <em> this </em> one

The angle brackets are used to mark the "em" and "/em" as special control sequences, not part of the "actual data". But then what if you want to write:

In maths today I learnt that 2 < 5

Well, you need to again use specially marked control sequences to request an "actual" < symbol:

In maths today I learnt that 2 &lt; 5

The general principle is that you have to choose one more more symbols and make them "special", as they mark control codes amongst the data. Yet if those symbols occur in the data, we need to do something complex to prevent them from being mistaken for control codes.

This is a problem because most file formats work in terms of 8-bit bytes. The "actual data" comes as 8-bit bytes, and the file format itself must combine that, along with its own control sequences, into a string of 8-bit bytes. There's no way for the file format to generally "reserve" some of those byte symbols for its exclusive use, unless it can guarantee that they will never occur in actual data.

But note the description above of entropy coding; an entropy coder takes an input, which is a sequence of symbols, and outputs a string of bits. I mentioned that most entropy coders start with a file - a string of 8-bit bytes - so use the 256 possible values of an 8-bit byte as their symbols, often with a 257th symbol to mark the end of the file.

However, if our file format always entropy coded its output, it could use 256 symbols for the 256 possible 8-bit input symbols, then use symbol 257 to mark (in the case of HTML) the start of emphasis, 258 to mark the end of emphasis, 259 to mark a paragraph break, and so on... And it would compress slightly more efficiently, since the control codes would be assigned output sequences with lengths that corresponded to their frequencies in the input; whereas in real HTML, a sequence like "<em>" would have to encode the e and the m according to their frequencies in the document as a whole, be they appearing in text or HTML tags.

Back when I was about 16 I considered entropy coding as a canonical interchange format for strings, for this reason and for the "UTF-8 dilemma"; Unicode text is defined as a sequence of 32-bit characters, covering Roman, Arabic, Russian, Japanese, Chinese, and generally any language on Earth. However, for English text, which generally only uses under a hundred characters (once you cover upper and lower case, numbers, and punctuation), using 32 bits for each character is a bit of a waste. In fact, for just about any language, 32 bits per character is a waste; even with Japanese and Chinese ideograms, any given message will rarely use more than a few thousand different characters.

There are encodings of Unicode such as UTF-8, which uses short codes for Western European roman characters and longer codes for Japanese and so on, while still being able to represent all of Unicode. UTF-8 is, thus, very popular, since lots of software is written in Western Europe and America, but it's inefficient for ideographic language users. So, I considered, one could have an entropy encoder that accepts 32-bit input symbols, with a few implementation details to make it only require memory for the symbols it actually saw rather than all four billion possible symbols, and use that to encode text for communication.

Recently, while thinking about transfer encodings for IRON, I have revisited this idea. Rather than sitting worrying about how to represent variable-length integers in as few bits as possible and all the usual concerns in designing something like XDR, BER, or PER, I'm seriously considering just encoding the input abstract values - numbers, symbols, characters, bit strings, and so on - input a stream of symbols with a broad alphabet, probably 17 or so bits in size; rather than have the entire Unicode alphabet have a symbol per character which would require 21-bit symbols, I would express Unicode character strings by splitting each character into the lower 16-bits, then having 17 symbols assigned to "setting the current upper word to X", and 65,536 symbols assigned to "The Unicode character obtained by combining the current upper word with X as the lower word" - with the default upper word at the start of a file being 0. These would be joined by symbols for the the integers from -127 to 255 inclusive (thus allowing for signed or unsigned twos-complement bytes, while making it clear which kind you mean), then a range of symbols for "The next X numbers are actually to be joined together into one big number, big endian" for numbers outside of this range, up to some X. Plus symbols for True and False, and then symbols for "Here begins an array" and other such compound-type boundaries.

The result? A file format that should strike a great balance between self-description (since numbers and strings and so on are encoded with totally different symbols, it's possible to decode the abstract value without access to any external information), compactness, and complication of the encoding. The complication of encoding isn't just me being lazy as a potential implementer of this thing - complex efficient encodings are notoriously difficult to extend efficiently and backwards compatibly in future; yet in the abstract-symbol file format, one can just add extra symbols for new types of data, which decoders will be able to identify and reject rather than mistaking them for something else, and the entropy encoder will be just as efficient, since it doesn't really care what its input symbol set is.

Of course, one can also add in dictionary compression before the entropy compression, as per DEFLATE, for extra space efficiency at the cost of more CPU time and the need to store a buffer of the past and future symbols to look for patterns. And for when CPU time is the important thing, rather than space, one can just skip the compression entirely and transmit a sequence of 18-bit symbols, probably rearranged a bit to make them easier to process on byte-based computers... which is still hardly as inefficient as XML.

Pages: 1 2 3

No Comments

No comments yet.

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales