Backup to a distributed system
bofh at stremler.net
Sun Aug 11 14:06:43 PDT 2002
Third time -- I've tried responding to this twice so far, only
to have something happen and the response get lost. Arg.
begin quoting Tracy R Reed as of Thu, Aug 08, 2002 at 02:48:19PM -0700:
> I've been wondering just how such a thing might work and I've got an idea
> or two. The biggest problem is of course that you cannot compress data
> beyond a certain point without losing info.
The whole lossy-compressing question is pretty much a topic unto itself.
> So there is definitely a
> mathemathically proveable hard limit.
Well, sorta. It depends on the data. I can compress an infinite sequence
of random bytes into just one or two bytes, if I can choose the byte
sequence. (e.g 'pi' or 'e') However, such solutions aren't really too
> Another problem is that redundancy
> is the opposite of compression and the more redundancy you have the more
> you have defeated the compression.
...but the more you make it robust, hopefully. :)
> But if you compresss and then make the
> compressed data redundant you still come out ahead by using less disk
> space to make your information redundant than you otherwise would without
> prior compression.
Not a given. It depends, significantly, on the data.
> Compression removes redundancy but of course that is
> useless redundancy. The kind of redundancy we want is the sort that allows
> us to reconstruct our data.
Not necessarily "of course". It also depends on whether you can handle a
little corruption or not. Randomly changing one byte in every 1k of
data across the entire system will trash any compressed-backup scheme,
but would merely be annoying to a distributed collection of text files.
Now, I agree that compression to remove needless redundancy can be a
good thing, if we assume that we are taking appropriate measures to
handle random corruption appropriately.
> A few months back I began investigating the Burrows-Wheeler Transform (
> http://www.dogma.net/markn/articles/bwt/bwt.htm ) which is the compression
> algorithm on which bzip2 is based. Very cool stuff and not hard to work
> out by hand for a very simple 8 byte example.
How many steps is that?
> Is is the most effective
> algorithm known but is is very cpu intensive.
So the tradeoff in questions is time versus space.
> Among its many interesting
> properties is the fact that the more non-random data you throw at it the
> more effective the compression is.
Isn't that an essential feature of huffman encoding?
> It doesn't do basic block compression.
I was under the impression that block compression gave some form of
corruption-resistance. That is, if you corrupt the tenth byte in block
12 of 15, you could recover the data from blocks 1-11 and 13-15.
Hm. According to my manpage, bzip2 is a "block-sorting file compressor".
> If I compress my whole hard drive and 5 users on my system have the exact
> same copy of a file in their homedirs the files won't be compress
> individually because they all fall into separate blocks (like with gzip)
> but but will eventually be discovered to be the same by the algorithm and
> squeezed together, reducing the redundancy by a factor of 5, and then the
This assumes that tar is used first?
> remaining data is compressed with something like run length encoding and
> then Huffman encoding to remove the final bits of redundancy.
Everything comes down th huffman, eventually, it seems. :)
> strings of data anywhere in the data are found and reduced. For english
> text or HTML or source code this results in huge savings.
ASCII is always a _wonderful_ compress subject. :)
> If a number of
> people all have the same taste in porn and have the same
> pics/mpegs/whatever it is all reduced down.
GIF, JPEGs, MPEGs, etc. are typically already compressed, so the only
saving would be for common bitstreams.
Didn't M$ do this with their compressing filesystems?
The problem, of course, is that you never quite know how much space
you will need to back everything up. The users could 'clean up' their
directories, and your backup space requirements could explode.
(e.g. your users take all of their .GIF porn and recompress it at
varying levels of quality into JPEGs. Now they're using half the disk
that they used to be using, but your backup space requirements have
> I've been wondering if it might be possible to build a cluster of machines
> which implement the Burrows-Wheeler Transform on huge amounts of data and
> then takes that reduced amount of data and adds some redundancy in the
> form of parity a la RAID 5 or something and distributing that parity to
> another machine. Very large block sizes could be used to reduce overhead.
How much parity will be needed?
It's been a long, long time since I looked at ECC...
> There are a few problems with this approach. The first is that
> Burrows-Wheeler might not be parallelizeable. I've been thinking it over
> and I'm not sure if it is or not. If it is it might involve quite a bit of
I rather suspect that the network I/O will be the limiting factor.
> bandwidth. The central part of Burrows-Wheeler is the sorting of strings.
> Are there parallel sort algorithms?
But sorting over network clusters is not likely to be efficient. Most
parallel sorting algorithms are for multi-processor shared-memory
> If all of the data in the system were
> taken as one huge image (the best way to get compression) would it be
> possible to insert and retrieve individual pieces of data without
> expanding the whole thing?
> I think you would have to un-Huffman and
> un-RLL a block to insert your data but you might not have to un-transform
> the whole thing although you would have to resort it all again to get good
> compression and to resort it all you may well have to un-Huffman and
> un-RLL it but you could do that on the fly I suppose without having to
> expand it all at once. I haven't quite worked this part out yet.
Well, if you know what blocks a particular file belongs to, then you
really only need to decode that block; however, the book-keeping is
likely to become a nightmare.
> problem is that the data put into the system cannot be encrypted. You
> really can't compress encrypted data.
Is this because the encryption algorithms result in mostly-random data,
or do they employ compression as a part of the encryption process? (And
if they don't do the latter, then why not?)
> So this might not be so good for
> backing up data among non-trusted parties. It wouldn't be so good for
> MP3's either which won't compress well but it would be good for removing
> the redundancy of lots of popular identical MP3's.
So would a central streaming server. :)
> This might be best used for a company-wide backup solution where security
> isn't a problem and the data is mostly executables (lots of common
> executables), documents, source code, etc. Every PC in the company could
> set aside a gig or two of disk and run a client which would work on backup
> data processing in the background.
How would this be any better than a centralized file server and a lot
of caching networked filesystems? You'd have more disk available on each
machine for local duplication (because it's likely that most people
would only use a small fraction of the total software on their system
at any one time), as well as remote synchronization.
> Just some brainstorming I've been doing lately...
Most of the stuff in the previous replies had been about why compressed
data was so hard to compress further, and the fact that there are more
"possible data sets" that aren't compressible than those that are.
Remember, the purpose of backups is to trade space for reliability. So
a simple mirroring scheme among machines on a local network would do
the job quite well. It would, however, be expensive in space. (For best
results, you'd have to give up 2/3rds of your disk space to keep a
couple of copies of the remaining 1/3rd on other machines.)
However, your 'eliminate common files' approach could be adopted to a
common backup scheme to allow for greater redundancy AND less disk usage
without having to resort to complicated compression schemes.
Contemplate a simple-minded and wasteful (but very easy to recover)
double-mirror scheme among the machines A, B, C, and D.
A mirrors its disk to B and C.
B mirrors its disk to C and D.
C mirrors its disk to D and A.
D mirrors its disk to A and B.
Any two machines can crash, and the data can be recovered.
But it's wasteful. But we take your approach and notice that the same
set of application programs exist in A, B, C, and D. So we replace the
file in the 'mirrored' portions of the disk with a checksum and md5
hash, and a pointer to the local (installed) copy of that file.
So long as a change to a file updates the appropriate (remote and local)
entries, we can maintain consistency AND save a fair bit of space on
nearly-identical machines. You could even do the 'optimization' step
in idle processes, and the local change detection would be a copy-on-write
sort of functionality.
-Stewart "Third Time is a Charm" Stremler
 I've held, in my hands, a book of "random numbers". Left-hand pages
were the digits of an expansion of e, and right-hand pages were an
expansion of the digits of pi.
 Because it's cheating. One way to measure randomness of data is to
attempt to compress it -- the more it compresses, the less randomness
 It wouldn't hurt, still, to do routine backups....
More information about the KPLUG-List