Optimal File Ordering in Tar Files

In January 2011, I needed to make a small update to a tarball I distribute and noticed that despite having minimal changes, the new version was almost 1MB larger than the original 4MB file.

A 25% increase, even for something so small, was too much to pass by without investigation, and what I discovered was that file ordering inside tarballs can have a pretty large difference in the compressed filesize. I probably shouldn't have been surprised, really, given the nature of compression algorithms, and I'm certainly not the first person to notice this (here's a blog post about it, and here's a technical paper discussing the issue) but I was nevertheless quite surprised at what a difference it could make.

Anyway, there's probably not much point to this page, really, but I felt it might be worth having another mention for Google to pick up on. In particular, I'd love to see an implementation of the algorithm mentioned in that paper. The original file I noticed this on, if you're interested in playing around with it yourself, is here: wizard_people.tbz2. In its current form it clocks in at 4057187 bytes, but if I untar and then re-tar on my system it'll grow to 5040770.