Checksums
A checksum is the value returned from a one-way hash algorithms. It can be used to validate the integrity of the data because modifying the data in any way will change the value returned. For this reason, checksums are sometimes referred to as “fingerprints”.
Checksums are often posted along with downloadable files. A user downloading the file can run the algorithm on the file after download to ensure that it is the same as the file posted. It cannot be modified by transmission errors or malicious intent without changing the checksum.
A checksum can also be used automatically during online communication. The sender sends the checksum of data, then sends the data. The receiver can verify that received data is complete and accurate by deriving the checksum.
An algorithm suitable for checksums needs to be fast, widely available, and require no keys. If you only need to detect accidental corruption (transmission errors, disk bit-flips), a non-cryptographic algorithm like crc32() is sufficient. If the checksum must also detect deliberate tampering, use a cryptographic hash such as SHA-256 via PHP’s hash() function. md5() and sha1() are still available in PHP but should no longer be used where collision resistance matters: practical collisions for MD5 were demonstrated in 2004, and for SHA-1 in 2017 (the SHAttered attack). Note that password-hashing functions like bcrypt are not suitable as checksums either: bcrypt uses a random salt (so the same input produces a different output each call) and silently truncates its input at 72 bytes.
<?php
$string = "Give me a checksum.";
// Non-cryptographic; fine for accidental-corruption detection.
echo crc32($string);
// 3703541059
// Cryptographic; recommended when tamper detection matters.
echo hash('sha256', $string);
// 6a0559a7ac0da7318e93cdfc1affd1f1e768c2b80e597d579d9e5832d8ed11a0
// Legacy; no longer recommended where collision resistance matters.
echo md5($string);
// cfa5d275b53523cc6b393b4b76da2da7
echo sha1($string);
// b813c8d640644c451d3a45b628a6ebbe60fbb9ba
?>
Collisions
A collision is when two pieces of data have the same checksum. This is unavoidable when distilling large data sets down to a short string. It must be true that there are not as many possible representations of the data as there are possibilities for the data. For example, MD5 returns 32 hexadecimal characters for a short string and also for a 1 GB file. In general, the shorter the returned hash, the fewer character choices available for representation, and therefore the more collisions which are possible.
Collisions only become a problem for large data sets. When comparing two files—an original file and a modified file—it is highly unlikely that they will generate the same hash. However, when calculating the checksums for millions of files, it becomes much more likely that two of those files will generate the same checksum even though the input is different.
This is one reason why some hash algorithms are considered unsuitable for storing passwords. It becomes too likely that more than one password will yield the same result. Imagine that an attacker is trying millions of passwords in an attempt to guess a user’s hashed password. Collisions could mean that the attacker doesn’t have to guess the correct password, but could find another password which yields the same hash and be considered valid.
Checksums in Git
The Git Version Control System uses SHA-1 checksums on the contents of all change commits. In fact, the checksum is used as commit identifier and commonly referred to as “the SHA”. Git’s checksums include meta data about the commit including the author, date, and the previous commit’s SHA.
Git assures the integrity of the data being stored by using checksums as identifiers. If someone were to try to alter a commit or its meta data, it would change the SHA used to identify it. It would become a different commit.
Git ensures that the historical chain of commits cannot be edited either, because each SHA includes meta data about the parent commit which precedes it. Altering one commit deep in the history would create a waterfall effect where every child commit had to recalculate its SHA as well. The history would become a different history.
SHA-1’s collision resistance was practically broken by the SHAttered attack in 2017. Git mitigates this by using a hardened SHA-1 implementation with built-in collision detection (SHA-1DC) by default since Git 2.13, which rejects objects that show evidence of a known collision attack. Since Git 2.29 (2020), Git also supports SHA-256 as an alternative object format (git init --object-format=sha256), though interoperability with SHA-1 repositories is still maturing.