A ‘hash’ is a mathematical function that takes digital data of an arbitrary size and reduces it to a fixed size. For example, gigs of data can be reduced to 16 bytes using one popular hash called MD5. Another popular hash is called SHA256, and there are many others.
A key feature of such a function is that it is a one-way function – meaning that the output cannot be turned back into the input easily or at all because by design, such functions are ‘lossy’ – meaning that they lose some of the source data in order to reduce the input to a unique and irreversible output. A hash destroys the original data as part of their design, making it impossible to reverse the process.
Another key feature is that given the same input, the output will always be the same. This allows for different people & systems to use a common hash function and produce the same output. This is a key feature because it allows different people to independently verify the hash.
Lastly, another key feature is that given ANY change to the input data, the output must change. This means that a hash can be used to verify that the contents of a file have not changed.
An extremely simple example of what is happening can be shown like this:
Input: (1, 1, 3)
Output: 5
Given the 3 input numbers (1, 1, 3), and using a hash function that simply adds the numbers together, we get 5 as the output. Even knowing that our hash function simply adds numbers together, one cannot be given 5 as the output and know that (1, 1, 3) was the input. With some time, one might be able to start creating a list of possible inputs (this is called factoring), but it would soon become apparent that the possibilities are infinite once it is realized that the range and quantity of inputs isn’t limited.
The input (-3, 2, -1, 7) would also produce 5 as the output for example and that leads us to the problem with such a simple hash function. Collisions. Collisions occur when two different inputs produce the same output. To be useful, a hash function must do two things.
1. Reduce input data of any size down to a small consistent output size.
2. Produce a unique output for every unique input given.
One common way a hash is used is to verify a file. Someone writes a program and shares it with the world along with a hash for that file. People all over the world download the file and generate their own hash of the file using their own computer. If the hash they generate matches the one provided on the website, then they can be sure that the file they downloaded is exactly the same as when the author shared it. If even a single bit of information is added, removed, or changed by a hacker, then the hash will not match.
The same can be applied to documents like contracts. One person can draw up a contract, sign it, and then generate a hash of that file. Another party can review the contract and generate their own hash and if it matches they can be sure that they got the same contract that the author created. That party can sign the contract, and then generate a new hash that can be shared with the original author who can then verify that new hash. In this way, two people can be sure that any document they exchange back and forth online has not been altered in any way. Years later, the file can still be verified with the original hash.
Another common way a hash is used is to verify passwords. Systems can store your password in plain text but that is a very poor security practice these days. Instead, systems can use hashes to store passwords. If a system stores your password in plain text, then any attacker who gets into the system can see a full list of passwords with minimal effort. By using hashes to store password, the attacker may not get the passwords at all, or at least it will be much harder and require time and technical skill to figure them out.
(In reality, developers should be using new methods, like mcrypt, that are based on hashes but much better for passwords but we’ll keep it simple for this example.)
A system that uses hashes for passwords would collect your password when you create your account, create a hash of that password, and store the resulting hash in their database. Then later when you log in again, they would create a new hash of what you typed in, compare that to the hash they have on file, and if they match, let you in. Doing this helps protect your password from attackers and bad employees.
Hashes are very useful and are quite commonly used in the digital world. Sometimes just to make a program run faster or to keep the size of a database small, but other times to protect and verify important information.