Skip to content

LZW Mystery Cache

Hidden : 2/8/2006
Difficulty:
3 out of 5
Terrain:
2 out of 5

Size: Size:   regular (regular)

Join now to view geocache location details. It's free!

Watch

How Geocaching Works

Please note Use of geocaching.com services is subject to the terms and conditions in our disclaimer.

Geocache Description:

The cache is not at the posted coordinates, although it is within two miles.


Data compression is the art of converting data from an original representation to a more compact, compressed representation. Although with pictures or audio it might be acceptable if some information were lost during compression—keeping only an approximation to the original data—for many applications approximation is not acceptable. A data compression technique is lossless if there exists a corresponding decompression technique that always recovers the original data without error. As it turns out, it is impossible for any lossless data compression technique to always succeed in reducing the size of the original data. Consider all cases of original data consisting of n bits. There are 2n such cases, each of which must compress to a distinct version of compressed data, or else for some version it would be impossible to know which original case should be recovered. But if you count up all possible versions that contain fewer than n bits, you only get 2n-1 possibilities, which is not enough. In fact, it is typical for a lossless data “compression” technique to significantly increase the size of the data in many troublesome cases.

Nonetheless, in practice there are lossless data compression techniques that substantially reduce in size many useful kinds of original data. For example, it takes more than five million bytes to represent the complete works of Shakespeare as ASCII text. The text contains about one million words, of which about thirty thousand are distinct. As you might expect, common English words such as the, and, you, that, not, and with appear a lot more frequently than most of the other words. The popular lossless data compression program gzip exploits such repetition to compress the five million bytes of ASCII Shakespeare down to just over two million bytes.

Exploiting repeated words is the idea behind dictionary-based lossless data compression. This compression technique employs a dictionary of words likely to appear in the original data, assigns a reference number to each word, and then produces compressed data in which the original words are largely or entirely replaced by reference numbers. (Of course, instead of words, any substrings could be used.) One problem is that a dictionary suitable for English might not work so well with, say, Esperanto. A related problem is that some words, for example Exeunt, appear a lot more frequently in Shakespeare than they might in, say, a medical textbook. The solution is to create a specialized dictionary for the original data and then include the dictionary in the compressed data so that the decompressor can recover the original words. Of course, this solution has its own drawbacks, namely, the preparatory work required to create the specialized dictionary and the space required to include it in the compressed data.

In their groundbreaking publications in 1977 and 1978, Jacob Ziv and Abraham Lempel developed techniques for creating the dictionary on-the-fly as the compressor makes one pass reading through the original data and writing out the compressed data. Things are cleverly arranged so that the decompressor can also make one pass reading through the compressed data and writing out the original data, recreating at each step the identical dictionary that the compressor had at that point. The actions of the compressor and the decompressor are effectively locked together so that the decompressor always knows the meaning of each reference number it encounters, even though the dictionary never explicitly appears in the compressed data. In 1984, Terry Welch added substantial practical improvements to the 1978 technique of Ziv and Lempel. The resulting dictionary-based lossless data compression technique is known as LZW, after the initials of its inventors.

  • Jacob Ziv and Abraham Lempel, “A Universal Algorithm for Sequential Data Compression,” IEEE Transactions on Information Theory, 23(3), pp. 337-343, 1977.
  • Jacob Ziv and Abraham Lempel, “Compression of Individual Sequences Via Variable-Rate Coding,” IEEE Transactions on Information Theory, 24(5), pp. 530-536, 1978.
  • Terry Welch, “A Technique for High-Performance Data Compression,” IEEE Computer, 17(6), pp. 8-19, 1984.

LZW is clever, effective, and not really all that complicated. LZW became quite popular, notably being used in the Unix compress utility and in the definition of the hugely popular GIF image format. (GIF is a service mark of CompuServe, Inc.) Perhaps LZW became too popular, because people failed to notice that it was covered by U.S. Patent 4,558,302. Eventually the owner of the patent realized that LZW was being used in numerous places and began to demand license fees. One description of the resulting brouhaha can be found at http://purl.org/mcb/id/giflzw. The LZW patent expired in 2003 and its counterpart patents in Canada, Europe, and Japan expired in 2004.

In the following exposition, we describe the basic form of the LZW compression technique without going into various additional practical improvements developed by Welch. The LZW compressor maintains a dictionary DICT of unique strings. In contrast to an ordinary dictionary, which lists entries in lexicographic order, DICT lists entries in the order they were added to the dictionary. For each string S in the dictionary, the reference number of S, which we write as DICT[S], is defined as the number of strings that were already in the dictionary just before S was added. Observe that reference numbers start at zero and count upward. Initially, DICT is preloaded with all single-character strings. Additional strings are added as compression proceeds.

The LZW compressor operates by determining the longest string W that (a) starts at the current point in the original data and (b) is currently in the dictionary. This string is then represented by its reference number in the compressed output. Let C be the character that follows W in the original data. We call C the “stop” character, because it stopped the compressor from matching a longer string. From the definition of W, we know that the string defined by appending C onto the end of W, which we write as W+C, is not currently contained in the dictionary. So the compressor adds W+C to the dictionary. The compressor then moves its consideration of the original data to the location of the stop character and repeats the process of determining the longest string, now starting with C. Because the dictionary is preloaded with all single-character strings, anything that appears in the original data will match some string in the dictionary, so in the worst case the compressor outputs a reference number for each character. But as the dictionary builds up, repeated occurrences of longer and longer strings result in just one reference number for each repetition, thus compressing the data. A precise description of the actions of the LZW compressor is as follows:

Step 1. Set W to the empty string and preload DICT.
Step 2. If there are no more original data characters, goto Step 8.
Step 3. Read the next original data character and call it C.
Step 4. If W+C is contained in DICT, set W to W+C and goto Step 2.
Step 5. Output the reference number DICT[W].
Step 6. Add W+C to DICT.
Step 7. Set W to the single-character string C and goto Step 2.
Step 8. If W is not the empty string, output the reference number DICT[W].
Step 9. Stop.

Let us consider an example. For simplicity, we use a 27-character alphabet consisting of a spacing character * and the upper case letters A through Z. It is convenient to display the dictionary as a parenthesized list of strings, the original data as a parenthesized list of characters, and the compressed data as a parenthesized list of reference numbers. To reduce clutter, we separate the items in a list with space instead of commas.

We preload our dictionary with ( * A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ). Observe that the spacing character * is assigned reference number 0 and the upper case letters A through Z are assigned reference numbers 1 through 26, respectively. Let us see how compression proceeds supposing that the original data consists of the twelve characters ( T H E * C A C H E * I S ).

The LZW compressor reads the input characters in order, matching the longest string already contained in the dictionary. Starting off, the longest string the compressor can match is the single-character string T, which has reference number 20. So the compressor matches this string and outputs the corresponding reference number. The “stop” character is H, so the compressor adds the two-character string TH to the dictionary, giving it reference number 27. At this point, the output is ( 20 ), the unmatched input is ( H E * C A C H E * I S ), and the dictionary is (* A B C D E F G H I J K L M N O P Q R S T U V W X Y Z TH ). Note that one string has been matched, one reference number has been output, and one string has been added to the dictionary.

The next six input characters are each matched in exactly the same way. The compressor matches a single-character string, outputs its reference number, and adds a two-character string to the dictionary. After these additional six matches, the output is ( 20 8 5 0 3 1 3 ), the unmatched input is ( H E * I S ), and the dictionary is ( * A B C D E F G H I J K L M N O P Q R S T U V W X Y Z TH HE E* *C CA AC CH ). Six additional strings have been added to the dictionary and given the reference numbers 28 through 33.

Now something interesting happens. Observe that the next two characters of the original data form the string HE, which is already in the dictionary with reference number 28. The compressor cannot match any further, because * stops the match. So the compressor matches HE, outputs the corresponding reference number, and adds the three-character string HE* to the dictionary. At this point the output is ( 20 8 5 0 3 1 3 28 ), the unmatched input is ( * I S ), and the dictionary is ( * A B C D E F G H I J K L M N O P Q R S T U V W X Y Z TH HE E* *C CA AC CH HE* ).

Thereafter, nothing novel happens during the final three characters, except that the compressor has to output the final reference number for the string it is matching at the time the input data runs out. The final output is ( 20 8 5 0 3 1 3 28 0 9 19 ) and the final dictionary is ( * A B C D E F G H I J K L M N O P Q R S T U V W X Y Z TH HE E* *C CA AC CH HE* *I IS ). The final dictionary is not needed for anything, so it can simply be discarded. All necessary information is contained in the output reference numbers.

In this example, LZW compressed twelve characters to get eleven reference numbers. This is not very good compression, especially considering that each reference number requires more bits of storage than a character. However, this is also a very short example. Realistic examples would be thousands of characters long. On long English ASCII text, LZW typically reduces the size of the data by at least half.

( 20 8 5 0 3 1 3 28 0 9 19 0 1 0 6 9 6 20 25 30 1 12 9 2 5 18 38 13 13 15 2 15 24 30 15 12 15 18 5 4 0 48 11 29 1 14 0 5 24 20 18 39 18 9 16 29 2 71 84 0 15 87 0 25 5 37 19 21 3 3 91 37 16 1 18 11 38 20 86 14 29 14 9 106 0 16 15 109 104 41 22 29 27 64 29 5 9 7 8 104 13 109 21 20 91 0 14 63 27 105 29 26 51 15 111 113 14 104 19 5 117 72 6 15 21 52 119 5 29 127 14 129 131 23 91 104 8 9 69 0 27 29 23 163 130 86 1 102 76 1 9 12 166 173 29 4 154 152 77 177 37 149 152 34 12 1 19 104 20 23 140 145 147 0 138 18 140 6 154 194 34 31 33 29 36 103 136 132 109 29 112 114 40 150 52 201 203 220 0 156 158 37 133 18 135 15 110 222 141 218 122 124 104 108 233 139 225 128 130 37 160 193 73 12 146 1 20 9 232 166 8 120 40 205 )

The cache originally contained various compression-themed materials, including:

  • Starbucks gift card (compressed alertness)
  • Squeeze smiley ball (compressed happiness)
  • California geocoin (compressed travelbug)
  • Ray Charles CDs (compressed music)

There is no requirement to maintain the theme when trading, but I'm sure you can think of something.

2/9/2006 — Congratulations to PurplePeople on the FTF. Thanks for providing a pen and an interesting bit of personal history.

2/10/2006 — No, that is not a mistake. 84 is correct. So is the other one. As I said, LZW is clever.

Additional Hints (Decrypt)

[Puzzle] Na rkgen evcr onanan nggenpgf ohtf.

Decryption Key

A|B|C|D|E|F|G|H|I|J|K|L|M
-------------------------
N|O|P|Q|R|S|T|U|V|W|X|Y|Z

(letter above equals below, and vice versa)