Textbooks Can Be Wrong
Mar 12th, 2012 by Alex
I was doing assigned reading from Guide to Computer Forensics and Investigations (Fourth Edition). The text on page 104 concerned itself with acquiring compressed disk images as evidence, particularly the importable of ensuring that the copied data is the same as the original.
Popular archiving tools, such as PKZip, WinZip, and WinRAR, use an algorithm referred to as lossless compression.
According to the book, lossless compression is used for forensics acquisition. Lossy compression isn’t, because it alters the original data.
That makes sense.
But then came the following text:
An easy way to test lossless compression is to perform an MD5 or SHA-1 hash on a file before and after it’s compressed. If the compression is done correctly, both versions have the same hash value. If the hashes don’t match, that means something corrupted the compressed file, such as a hardware or software error.
When I read this, I stopped. I didn’t believe it. While hash collisions are certainly possible from a mathematical perspective, they aren’t likely during everyday use. But I had to see, just for my own sanity.
$ echo "Hello world." > test.txt
$ md5sum test.txt
fa093de5fc603823f08524f9801f0546 test.txt
$ sha1sum test.txt
4177876fcf6806ef65c4c1a1abf464087bfbf337 test.txt
$
$ zip test.zip test.txt
adding: test.txt (stored 0%)
$ md5sum test.zip
76e13462f780d302e0eb0246c6e4d6d4 test.zip
$ sha1sum test.zip
edfc4a3ee2082fef21dae7028d11d031eef65242 test.zip
Just as I suspected, neither method generates the same hash. I wouldn’t expect it to, even if the file were just stored within the archive there’s still content metadata that is added to the archive.
$ rm test.txt
$ unzip test.zip
Archive: test.zip
extracting: test.txt
$ md5sum test.txt
fa093de5fc603823f08524f9801f0546 test.txt
$ sha1sum test.txt
4177876fcf6806ef65c4c1a1abf464087bfbf337 test.txt
As expected, the decompressed file has the same hash as the original.
I suspect this was merely an editing error. Perhaps a non-technical editor simplified the paragraph and none of the proofreaders caught it. I hope that the authors don’t really believe the cited sentences above. I hope that digital forensic evidence isn’t being tossed out due to compressed file hashes not matching up to their source hashes.
I think the text intended to say that the hash of the original file should remain the same as a decompressed file, presuming that a lossless algorithm was used. But that’s definitely not how it sounded, and could be critically misleading to someone without previous knowledge.
Textbooks are wrong more than authors, editors and publishers would like you to realize. Sorry about the whole bubble bursting thingy.
Others have noted that there are errors in the text. Not this particular error, but alluded to other errors in this textbook:
http://www.computerforensicsworld.com/modules.php?name=Forums&file=viewtopic&t=689