Other Project |
Project Babel (The Library of Babel) |
Objective: |
|
Compression/Decompression System :: To provide a proof of concept platform for the re-hydration of data files from cryptographic hashes, a platform that is capable of massively compressing data in a way that will scale with enhancements in processors that are due within the decade, and that has the ability to be distributed to many computers in the mean time.
Project output should include C code and instructional operating documentation (if necessary). |
Description: |
|
In March 2009 I built an online hashing database to help a friend sort through thousands of files. The solution was not unlike Tripwire's (local) or Sun's (remote) file hash databases. The client program measured a file's size, its MD5 hash and its SHA1 hash, asked for some meta data, and sent that bundle of information to the server. At any time in the future, someone with a non-descript file could run the client program and submit a query with the file size, MD5 hash and SHA1 hash to the database to receive any available meta data on the target file. In this way, a user could know about a file, even if he didn't have the means to discover this detail directly from the file itself.
In late April 2009, it dawned on me that the uniqueness that I had exploited in the vector of the file's size, it's MD5 and SHA1 hash's could be used to completely recreate the contents of the original file.
The method is simple: create a file with a length of the target and set the value of every byte to zero. Now calculate the MD5 hash for the file, is it the same as the MD5 hash of the target? If not, increment the LSB (least significant bit) of the file and calculate the new MD5 hash. Is this the same as the MD5 hash of the target? No? Continue to increment and evaluate the MD5 hash until you have confirmed a hit. Now, calculate the SHA1 hash for the file. Is this the same as the SHA1 hash for the target? If not, then you have found one of the collision domains for the MD5 hash, and you need to return to its increment and compare loop. Once both the MD5 and SHA1 hash's match their respective target hash's you have found the collision domain for the two functions in this file size, and should have subsequently found your missing file contents.
There are two limitations to this method of decompression. The first is the quality of the cryptographic hash functions and the number of raw bytes that they can uniquely re-address in hash form. The second is the magnitude of processing capacity required to execute this procedure - effectively brute-forcing binary key-spaces that are as large as the largest file that you wish to decompress. Both of these issues affect the compression ratio, the former in terms of address space and the latter in terms of practical time to re-hydration.
The solution to both problems is to break the raw file up into manageable blocks of raw data and hash the blocks, rather than the entire file.
Enter Babel. The Babel program works with two arbitrary hashes, currently MD4 (for speed) and SHA1 (for accuracy). The two hashes collectively consume 36 bytes. Thus by altering the raw block buffer size in relation to the constant 36 byte storage figure, it is possible to arbitrarily* elect the compression ratio, weighted between size (1:1 at 36 bytes) and effort (10:1 at 360 bytes).
The source for this program contains details on accelerating the decompression process, information about the naming convention (including the Crimson Hexagon reference), compressed file format specifications and a table showing example compression settings.
As Babel stands it would be relatively simple to add a distributed node feature, as the key space is already divided into 65,536 hash buckets. This could be further evaluated for integration into the P2P DHT (Distributed Hash Table) protocol.
* In a single hash model the practical limit is the colision domain of the selected hash. However the colision domain of multiple hashes isn't as clear.
|
Code: |
|
The following code (source, binaries, patches, etc) have been developed or mirrored for this project;
|
Links: |
|
The following links have been identified as relevant to this project;
|
Activity: |
|
This project was initiated on Saturday, the 11th of April 2009. Its last recorded activity stamp is Tuesday, the 12th of May 2009. |