Backup to a distributed system
JD Runyan
jrunyan.lists at dms.nwcg.gov
Thu Aug 22 08:58:02 PDT 2002
I think the september issue of DDJ has an article on a distributed
filesystem for multiple platforms. You might check it out.
On Thu, 2002-08-22 at 10:49, Bob La Quey wrote:
> Ok, I just found this, an outgrowth of MojoNation.
>
> It has all of the features that we were discussing below.
> Unfortunately it is a proprietry product, at least for now.
>
> HiveCache is a distributed backup product that gives enterprise
> users the convenience of an online backup system without the
> costs of a backup service provider. Agent software running on
> each enterprise PC links the systems together into an adaptive
> online backup network, turning your unused disk space into an
> additional component of your IT infrastructure.
>
> At 04:02 PM 8/8/2002 -0700, you wrote:
> >At 02:48 PM 8/8/2002 -0700, you wrote:
> >
> >>I've been wondering just how such a thing might work and I've got an idea
> >>or two. The biggest problem is of course that you cannot compress data
> >>beyond a certain point without losing info. So there is definitely a
> >>mathemathically proveable hard limit. Another problem is that redundancy
> >>is the opposite of compression and the more redundancy you have the more
> >>you have defeated the compression. But if you compresss and then make the
> >>compressed data redundant you still come out ahead by using less disk
> >>space to make your information redundant than you otherwise would without
> >>prior compression. Compression removes redundancy but of course that is
> >>useless redundancy. The kind of redundancy we want is the sort that allows
> >>us to reconstruct our data.
> >
> >There is a very highly developed theory for such purposes. It has been used
> >extensively in satellite communications systems. Essentially one adds bits
> >to the data stream according to carefully constructed rules in just such a
> >way as to make it possible to recover from the types of noise e.g. block
> >drop outs, etc that one expects in the particular environment.
> >
> >Here are a couple of examples
> >http://it.sns.ornl.gov/asd/public/pdf/sns0045/sns0045.pdf
> >
> > From a course on Coding Theory
> ><quote>
> >Error-correcting codes are ubiquitous in communications systems. They
> >allow engineers to scrimp on other design variables like power utilization.
> >The resulting degradation in the performance of the system is counterbalanced
> >by the error-correcting code. Briefly, the code works by building in a bit of
> >redundancy at the transmitting end and then using that redundancy at the
> >receiving end to identify where errors occured in transmission and correct
> >them.
> ></quote>
> >
> >I note that the "optimal" code is very dependent upon the type of error
> >that one is coding against.
> >
> >Further these codes are already used at a lower level in hard disks already.
> >
> >On needs model the storage network and look at the type of errors it
> >will induce and then develop a coding strategy for laying the data
> >downon the resulting "noisy" network.
>
> From their FAQ http://www.hivecache.com/FAQ.html
>
> The HiveCache backup architecture uses a special encoding process
> when data is sent out to the peers for storage. It breaks a file up into
> lots of smaller pieces and adds error correction information to each
> piece to build a distributed RAID network. If individual PCs in the
> HiveCache backup network are turned off, crash, or are otherwise
> offline the remaining agents can use this extra error correction information
> as well as replicated data to reconstruct the original file from the pieces
> that are still available. HiveCache operates like a colony of bees, even
> if you swat a few of the bees the swarm will continue to gather pollen for
> the hive.
>
> [snip of older message]
>
> >>So this might not be so good for
> >>backing up data among non-trusted parties. It wouldn't be so good for
> >>MP3's either which won't compress well but it would be good for removing
> >>the redundancy of lots of popular identical MP3's.
> >
> >Well MP3s are already highly compressed. But removing identical files
> >seems sensible.
> >
> >
> >>This might be best used for a company-wide backup solution where security
> >>isn't a problem and the data is mostly executables (lots of common
> >>executables), documents, source code, etc. Every PC in the company could
> >>set aside a gig or two of disk and run a client which would work on backup
> >>data processing in the background.
> >
> >That is exactly the scenario I envision.
>
> More from the FAQ http://www.hivecache.com/FAQ.html
>
> I checked the disk space in use on the 100 PCs in our department and
> they have an average of 5 GB of programs and documents (500 GB total),
> how can you back up that much data if each PC only contributes 1 GB of
> space (100 GB total)?
>
> There is actually very little unique information on each enterprise PC. We
> all share the same sequences of bits that are common applications
> (e.g. word.exe), operating system files, and standard utilities. The only
> data that is really unique to these systems is the data that we input
> ourselves or gather from the net. Within a company or workgroup a lot
> of this data is replicated as documents and worksheets that get sent
> to peers as email attachments or copied around the LAN. Preferences
> and settings, personal documents, and work in progress represent most
> of this unique data and it is not a large percentage of the total enterprise
> data pool applications and operating systems files that are found on
> every PC make up the bulk of this enterprise data, and so an significant
> amount of storage space can be saved by only keeping around enough
> replicas of these common files to ensure reliable recovery.
>
> Microsoft is exploring this as well.
>
> http://research.microsoft.com/sn/Farsite/publications.htm
> http://research.microsoft.com/sn/Farsite/faq.htm
>
> From the Microsoft FAQ:
> =======
> What are you building?
> We are building a symbiotic, serverless, distributed file system. As a
> file system, its purpose is for storing files. It's distributed in that
> it runs on multiple machines, not just a single machine. It's serverless,
> meaning that it does not make use of a central server or a cluster of
> servers; it runs entirely on client machines. And it's symbiotic, meaning
> that it works among cooperating but not completely trusting clients.
> Logically, the system functions as a central file server, but physically,
> there is no central server machine. Instead, a group of desktop client
> computers collaboratively establish a virtual file server that can be
> accessed by any of the clients.
>
> There is also a lot more going on.
>
> http://dmoz.org/Computers/Software/Internet/Clients/File_Sharing/
>
>
>
> --
> http://www.kernel-panic.org
> list archives http://www.ultraviolet.org
> To unsubscribe, send a message to the address shown in the list-unsubscribe
> header of this message.
>
--
Jason D. Runyan
USDA NITC KC
Mid-Range Systems
More information about the KPLUG-List
mailing list