UCLA ACM's First Programming Competition: File Anti-Compression
Problem
Write a file encoder whose output, when sent through gzip,
is compressed by as little as possible.
(The full UCLA ACM description of this problem, problem #2, is no longer on the web.)
Results
The results:
Date: Tue, 16 Nov 2004 01:02:02 -0800 From: Adam Harmetz <aharmetz@gmail.com> Subject: MPC Results! Thank you all for participating in ACM's Monthly Programming Contest. I have graded the results for the two problems this month. Problem 1 was open only to first and second years. There were two winners who correctly identified the longest palindrome in the text of Shakepeare's Hamlet. The winners were Shant Hovsepian and the two person team of Brian Chang and Grace Shih. All three people have won a prize package. In addition, Brian and Grace's solution was incredibly efficient and effectively used object oriented programming to create easy to read code. They deserve special accolades and won the style points for problem 1. There was great diversity within the answer set for project two. Perl, C++, Java, and shell scripting were all employed to try to beat gzip. Two themes emerged as "most successful" solutions. These solutions actually expanded the file length to be greater than the original file when "compressed." The first solution used a random number generator (or an array of random numbers hardcoded into the source) to XOR the bits of the original file. Since XOR is symmetric and random number generators will produce the same values if given the same seed, this solution had the interesting property that the encoder and decoder were the same program. The second solution was clever in its own way. It employed gzip itself to compress the original file. To ensure that the encoded file was the same length as the original, two different strategies were used. One used random data to pad the end of the file to the correct length. Another solution was to repeat copies of the gzipped data until the correct length was reached. Participants either stored the length somewhere in the file, added a special character at the end of the gzipped data, or took advantage of the fact that gzip will ignore trailing garbage. Either solution yielded the same expansion of the original file when gzipped. Chris Frost, Steve VanDeBogart, Wing-kai Chan, Adam Vartanian, Nathan Plumley, and Sean Soria were all able to implement one of the above optimal strategies. Other participants either employed strategies that resulted in compression (non-random choice of digits to XOR), or didn't fully take advantage of the various random number generators of Java and Python (too small a subset of random numbers yielded suboptimal results). Chris Frost, in addition to providing a wonderful narrative on gzip and an analysis of what an optimal solution looks like, also employed a clever trick to beat the rest of the participants. Since gzip stores the name of the file you are compressing inside the file, the longer the encoded file name the larger the compressed file. Chris chose the maximum filename length for his outputted encoded file. This showed a healthy knowledge of the inner workings of gzip and earns him the most significant prize package (wireless keyboard and mouse, Office XP Professional 2003, Visual Studio .NET, and Halo) as well as special accolades on this contest. Out of the remaining five high achievers on the contest, Nathan and Sean were randomly selected to receive a wireless keyboard from Microsoft. For the rest, I have copies of Halo for you. Thanks to everyone who participated. I'm really happy to see the enthusiastic response to this ACM program and a healthy desire to experiment with programming. Google and Real Networks will both be involved in designing future challenges and the Engineering Alumni Association came through with significant funding for future contests. Expect another contest after Thanksgiving to carry us through the holiday season. Thanks, of course, to Dr. Eggert for helping design problem 2 and to Microsoft for their generous support of the program. Please send me your resume if you would like to be included in the special packet being sent to Microsoft (it includes only those who participated in the contest). To pick up your prize, you are welcome to drop by my apartment if you are itching to get it soon (I live on Kelton). I will set up an "office hour" next week where you can drop by and pick them up from me on campus. As a final note, I'd like to let everyone know about the next major upcoming competition: Imagine Cup. Sponsored by Microsoft, there are a number of cool challenges and puzzles. Anyone who participated in our contest should find competing in Imagine Cup right up their alley. Email me if you want more info, or visit imagecup.com. You can email me if you would like more info on how your submission was graded or other questions. Take care, Adam
Solution
And if you're interested, my solution's description/analysis:
ACM Monthly Programming Contest Novermber 3rd Second Problem Chris Frost <frost@cs.ucla.edu> ** Building and Requirements GNU's make will build frontend. This program requires: some bits and pieces of posix (eg system()), gzip, gunzip, cp, cat, echo, and touch. ** Usage Running frontend will output a help summary. To encode a file: frontend -e <input_filename> To decode a file: frontend -d <encoded_filename> <output_filename> Note: the encoded filename should not be changed. frontend will attempt to detect any important modifications, but it does not detect all changes. ** Algorithm The goal of this program is to encode a given file so that when this encoded file is given to gzip, it is larger than any other person's entry :). Two primary questions come to mind: - How can I do this? - Is there a maximum size? My two obvious answers to how are to produce what looks like random information (you can't compress randomness well) or to give gzip gzip's output. While producing random noise would make decoding difficult, cryptography has long studied how to make make information have random properties. Were the compressor in this contest unknown, I think I would take the route of encryption, it would probably hold up better in general. However, knowing gzip is the compressor, we can get larger file using gzip, I believe. As to whether there is a maximum, at which point we can stop trying: yes. zlib, used by gzip, detects when "compressed" information would be larger than its original size. In this case, the block (32KB) is stored without any compression along with five bytes of header information. There is an additional fixed overhead for the file header plus the variable-sized filename in the file header. Thus, the maximum we can make gzip expand a file is by 5B / 32KB + ~50B for the fixed header + the encoded filename. This algorithm gets as close to the 5B / 32KB limit as my measuring has allowed me to inspect. (I am sure you could inch more space out, however.) This algorithm also takes advantage of the filename encoding as far as a filesystem will let it, enlargining the gzipped file by another about 2^8 B. (This is os dependent.) My primary reference for gzip has been http://www.gzip.org/zlib/zlib_tech.html, as well as their rfcs. *** Achieving 5B / 32KB Inflation Our goal is 5B / 32KB, this is how we try to get there. After a bit of experimentation, I found that running gzip on gzip output reaches a local maxima fairly quickly, typically in less than four rounds. What we want is for the judge's gzip run to inflate the file by as much as possible. Thus, we should compress the file one less time than this. However, things aren't quite this simple because we have to have the same file length as the input file. Gzip has one sweet property we take advantage of: it detects non-gzip data at the end of the file and ignores it. Our approach is thus this: gzip the original file a few times to get us one step away from the local inflation maxima, write a little non-gzip data, then write hard-to-compress data until the file is long enough. It would be more involved in the decoding stage if we determined the number of gzip runs used on a perfile basis, so we gzip the input file three times. (What a little testing showed to work well for a couple files.) We then write a few byte string as the non-gzip "magic" data. We then concatentate a copy of the compressed input file onto itself until the compressed input + magic + this footer has a size at least that of the input. The footer is then compressed twice (this turns out to do a fair bit), and then all three parts are merged into one file. This single file is then truncated to the length of the original file. Finally, we output the encoded file. However, gzip stores the compressed file's name in the compressed file. We can take advantage of this and give the encoded file a long file name, which will thus increase the size of the judge's gzip run by this much more. How long a filename can be is dependent on the os, with linux and ext2 filenames are limited to 255 characters. (There are C constants giving maximum filename lengths, but they are often much longer than what the os's filesystem actually supports. We thus determine this experimentally at runtime.) A couple snags: - Exactly as we are doing to gzip, we could be given a file that we could inflate. We detect this in all cases and fall back to "encoding" the input by copying it. We transfer this fact to the decoder through the filename. - We assume that the data we will be dealing with is larger than around 32KB. If this assume fails, our copies will be within gzip's blocksize and the judge's gzip run will compress our encoding. I'm assuming/hoping this won't be the case, as I think people will be more likely to take the mindset of thinking large files are difficult, and the problem statement states "a rather long text file" will be used. A side note: One interesting thing I found was that we actually want to *avoid* as much header information stored in the encode phase as possible! Headers are not compressed, so every byte in a header is a byte not helping push the number of 32KB blocks higher. Thus the encode phase tells gzip not to store original filenames and timestamps. On a 165KB file, this makes the judge's gzipped file perhaps 2B larger. :) encode() contains more detailed documentation on the encoding. *** Decoding decode() explains this pretty straightfowardly, we just reverse the operations of encode(). This turns out to be simpler since we gzip throws the footer and magic away for us and we never deal with them. *** Room for Improvement - Be able to analytically calculate how many times to gzip data and allow this number to vary per file. - With the above, we could run gzip on what is now the encoded file. From experiments, running gzip on a concatentated file will push the judge's filesize a little more. - Support file encoding even when the input can be compressed and cated a few times but not for the full process. - Perhaps use zlib rather than gzip, which I believe stores less header data. ** Problem Summary: Write an encoder and decoder such that, given an arbitrary text file, the encoder's output (which must be the same size as the original) is compressed the least by the standard gzip utility. You must be able to use the decoder on the encoded file to get back the exact original. Eligibility: All UCLA students Due date: November 3rd Further description: To explain this in more detail, we will explain how we will be testing your encoder/decoder. First, we will choose a rather long text file, call it filea.txt. Next, we will run your encoder on this text file. The result of this process should be a separate file that is the exact same length as the original, call it fileaencoded.txt. You may produce fileaencoded.txt in any manner that you see fit. However, we must be able to run your decoder on fileaencoded.txt to produce filea.txt again exactly (We will use diff to check). To determine the winner of the contest, we will run gzip on fileaencoded.txt to produce fileaencoded.gz. Whichever competitor?s fileaencoded.gz is the largest will be the winner of the competition. Note that we may run this process on several different files and take the weighted average file size of the trials. What to turn in: You must turn in the source of your encoder and decoder, written in the language of your choice. Both files must be able to compile (if necessary) and run on a SEASnet machine. Please also turn in a README file if appropriate. Email your files to ucla.acm at gmail.com What determines the winner: Whoever has the largest average file size across all our test cases will win. If there are multiple entries that fit this criterion, all winners will be entered in a random drawing for available prizes. In addition, ?style points? will be awarded for clever, efficient, or otherwise unique answers. These contestants will receive additional prizes.