How To Build Corruption Proof Files

Back in the days of floppy disks, corruption was a very big issue. You might copy your school paper to a 3.5“ Floppy Disk, put it in your backpack, take it to school, and find that you can view your paper has been erased or is unreadable. With the advent of Flash Drives, and the miniaturization of Hard Disk Drives, this is unlikely to happen anymore. Still, files do become corrupt or altered without you knowing it.

The Situation

Most text-based file formats (Text, HTML, .doc, .docx, or anything you can somewhat read in notepad) are easy to recover in the case of file damage. On the other hand non-text based files formats (.exe, .rar/.tar/.zip, .dat, or anything that is gibberish in notepad) are impossible to recover if the file is damaged.

Consider the following situation: You have an External Hard Drive with somewhat important files (we'll say 3 year old backups) on it and your drive is getting full. Its not worth your money to go out and buy a bigger External for files you hardly ever access, but you need to add some more files. So what do you do? You compress the files. Now the problem with file compression is that once you have compressed 4GB of files into say 1GB, you're stuck with a single 1GB file. What happens if any part of that file gets damaged? You lose all your data (in some cases). Depending on where in the archive the damage occurred, you could be stuck with anywhere from minimal to complete data loss. You don't want to take that chance. Some might suggest splitting the archive up into archive.rar.001 to archive.rar.010, effectively making 10 100mb chunks. This doesn't really help, because all the decompression application does is read these files end-to-end, which makes them the same as a single 1gb file.

You've gained space, but traded it for reliability. If you had 1000 different uncompressed files and the drive were to get damaged, depending on the severity, you may only lose 1-2 files, probably not even the ones you need. Now if the drive gets damaged, you have one large file, and it could destroy all 1000 compressed files contained in the archive. So, how do you deal with this issue? Parity Checking.

How It Works

Parity is method of error correction that allows you to rebuild corrupt data at the byte level. Perhaps the most popular implementation of Parity Checking is in certain types of RAID arrays. Lets assume you have 3 pieces of data and a parity block:

A1 = 00000111
A2 = 00000101
A3 = 00000000
Ap = 00000010 (parity block generated by: A1 XOR A2 XOR A3)

Look at these as 3 numbers and their sum (though this is not exactly what XORing does)… Lets say A1=1 A2=2 A3=3 Ap=6. You form your Parity bit (Ap) after you know the data, so A1+A2+A3=Ap. Now, imagine if you somehow lost A2. You would have A1+X+A3=Ap, which equates to 1+X+3=6. Find X. X=2, so you can find any value if you know the ending value. Similarly you could do the same to A1, A3, or Ap. The catch is, if you lose two pieces your data is unrecoverable. X+Y+3=6. How do you know that X=1 and Y=2 instead of X=-100 and Y=103?

This is similar to how XOR works, You can XOR 0s and 1s together to generate a 0 or a 1 depending on their order (among other things). Ap = A1 XOR A2 XOR A3. If you lose A2, you can do “A2 = A1 XOR A3 XOR Ap” and regenerate it. In RAID 5 you can lose an entire 100gb disk, and still have your array running. Just replace the bad drive, tell your RAID controller to regenerate the bad disk and you're back up and running. The problem is if Disk1 is offline, and before you can replace it Disk2 goes bad, all your data is lost.

File Parity Checking

So, back to our original problem. You have a huge compressed file, that is at risk of getting damaged.

construction/how_to_build_corruption_proof_files.txt · Last modified: 2011/10/14 10:35 by 127.0.0.1
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0