SHA 1-2-3

Published: 2012-04-02
Last Updated: 2012-04-02 16:50:25 UTC
by Johannes Ullrich (Version: 1)
9 comment(s)

NIST is currently in the final stages of defining a new "secure hashing algorithm" (SHA) [1]. The goal of the competition is to find a replacement for the current standard ("SHA-2", aka SHA-256 and SHA-512). NIST attempts to be somewhat proactive in defining crypto standards, realizing that new standards need to be implemented well before the old once are considered broken.

The three popular hash functions, MD5, SHA1 and SHA2, all use variations of a particular hashing algorithm known as "Merkle-Damgård Construction". Attacks have been developed for MD5 and SHA1, and it is plausible that they will be extended to SHA2 in the future. As a result, the candidates for SHA-3 use different algorithms that are hopefully safe from similar attacks.

A good cryptographic hash will make it hard to come up with two different documents that exhibit the same hash. There are a number of variations of this attack. For example, weather or not one of the documents (or hashes) is provided. One attack in particular affects the Merkle-Damgård based hashes, the "length extension attack". This attack is in particular relevant if a hash is used to verify the integrity of a document.

In this attack, the hash is created by concatenating a secret and a file, then hashing it. The hash and the file (without the secret) at then transmitted to a recipient. The recipient uses the same secret to recreate the hash and to verify if the file is authentic. However, for ciphers susceptible to the length extension attack, an attacker may calculate a new hash, if the attacker knows the size of the original hashed document. However, the attacker can only add to the document.

To make this more specific, lets say I am sending you a contract. We agreed on a pre-shared secret. I am creating the hash of "secret+contract" and send the hash to you with the document. An attacker now intercepts the message, and adds a page to the contract, and calculates a new hash. All the attacker needs to know (guess?) is the length of the secret.

Current hashing functions can be used safely to authenticate messages, but the algorithm has to be slightly more complex. Instead of just appending the key, an HMAC algorithm has to be used, which essentially applies the key to the message by XOR'ing the message with the key before hashing (just use HMAC.. its a bit more complex then that and has to be done right)

But back to SHA-3: NIST has narrowed the field of potential candidates down to 5. All of them are safe with respect to the length-extension attack. However, they all take up more CPU cycles then SHA-2, unless in some cases where HMAC is required. In addition, at a recent IETF conference, it was pointed out that during the competition, SHA-2 turned out to be more robust then expected, reducing the need for SHA-3 to replace SHA-2.

So why should you care? There are a number of reasons why you should be concerned about encryption standards. First of all, many compliance regiments require the use of specific hashing and encryption algorithms. Secondly, while there may be equivalently strong algorithms, usually developers of libraries spent more effort in optimizing standard algorithms, and you may even find them "in silicon" in modern CPUs. I wouldn't be surprised to find a "SHA-3" opcode in a future main stream CPU. At this point, SHA-256 or SHA-512 should be used if you are developing new software. However, if you find SHA-1, you shouldn't panic. Make sure you are using HMAC properly, and are not just concatenating secrets with documents in order to validate them.


Johannes B. Ullrich, Ph.D.
SANS Technology Institute

9 comment(s)


Meh. SHA1(gzip(data)) + MD5(gzip(data)) is nowhere close to being broken. Even SHA1(data) + MD5(data) will hold for a long time to come.
but that will require calculating 2 hashes instead of two. If you can calculate 1 not-so-broken hash in less time then two broken hashes, the not so broken hash wins imho.
Joshua, please tell me you're not a software developer. It's this flippant regard to cryptography that lends itself to hundreds of security breaches every day through poorly designed software. We have standards for a reason, the crypto gods have set them because they have done the cryptographic analysis.

Firstly, your statement raises more questions than it answers:
When you say '+', do you mean addition or concatenation? If it's addition, is it little-endian or big-endian? Signed or unsigned? How do we deal with the different lengths of the SHA1/MD5 outputs? If concatenation, then why do you use MD5 at all, given that concatenated hash functions are only as secure as their strongest component? Given that both MD5 and SHA1 are considered broken with respect to collision resistance, how does the concatenation step offer any additional security in this attack scenario?
What properties do you need from your hash? Do you need second pre-image resistance or just pre-image resistance? Collision resistance? What application are you using it in? Is it for use in a HMAC? Or salted password storage? Or a message integrity code? A digital signature? A zero-knowledge protocol?
Why do you gzip the data first? What about bzip2? lzma? Can I choose which compression algorithm I use or does it have to be gzip? Is the compression supposed to add any level of security? If so, how, given that the algorithm is public knowledge? If not, why do you do it? It seems to add far more computational complexity with no benefit, given that the hash outputs are fixed size.
What do you mean by 'nowhere close to being broken'? Do you mean that it is computationally secure? Against what level of adversary? Or did you mean information theoretic secure? Can you prove it?
Is there a reason you chose a simple addition/concatenation over some more complex operation such as SHA1(MD5(data)) or MD5(SHA1(data) + data))?

Secondly, you should never *ever* treat cryptographic constructs as black box operations. As a toy example, consider what happens when you use a caesar cipher twice. It's the same as just using one caesar cipher with a different key. Now consider what happens if you use two caesar ciphers, with one cipher using a key of +3 and the other using a key of -3. Treating a crypto construct as a black box operation is convenient for a programmer, but you need to look at the entire information path as a whole (through expanding the constructs as white boxes), rather than looking at each operation in isolation.

Basically, what you've done with your SHA + MD5 construct is you've just invented your own hash function, but you haven't even considered whether this hash function you've invented is secure, much less proven it. Have some humility; you're not knowledgeable in cryptography (at least, not enough to develop your own crypto) so just follow the damn standards. If you want a hash now, just use SHA2. If you want a hash in 5 years time, just use SHA3. NEVER roll your own.
Dear Mike,

Joshua is obviously a developer, and so, like me, he's probably lazy too lazy to add two hashes. It means that the + sign denotes concatenation.

So, Johannes is right: concatenating two semi-broken hashes is better that using only one of them, but it's better to just use a better hasing algorithm.

And you're right as well: NEVER roll your own. If Joshua would have numerically added the two, then that might qualify as rolling your own. But concatenating them does not. It's like putting a second lock on your door.

@Mike et. al.: I've also wondered why @Joshua's suggestion isn't a practical way to reduce the chance of forgery for non-ephemeral data.

"+" doesn't necessarily mean addition or concatenation or some other specific operation, it means having two redundant hashes computed using two separate algorithms. Presumably those algorithms will have different weaknesses or failure modes, and having multiple hashes will reduce the likelihood that a newly-found weakness or failure in a hash algorithms will allow a forgery to pass undetected because the document was hashed using only that algorithm. The likelihood that one forged plaintext would generate the same hashes as the original plaintext under _both_ hash algorithms is very small, correct?

Note that I am not suggesting this is better than a new hash algorithm with no known weaknesses. I am suggesting it's a way to improve future trustworthiness of the validation of long-term data signed using current hashes in the face of continuing attacks. No hash algorithm is guaranteed to be free from weaknesses, so design in graceful failure in the face of the inevitable flaw.

Using two hashes to validate network traffic is, as suggested, a poor idea compared to using a newer, more robust single hash. But how do you apply that newer, more robust single hash to a legal document that was cryptographically signed _ten years ago_?

I wish that PGP, GPG, et. al. had used more than one hash algorithm for signing _files_ from day one, and I don't understand why that is a bad idea. Is it actually _easier_ to forge a document if two hashes using different algorithms are used, and one has a known failure mode?
"@Mike et. al.: I've also wondered why @Joshua's suggestion isn't a practical way to reduce the chance of forgery for non-ephemeral data."

You're right it has a great potential to reduce the chance of forgery; the chance of forgery is still miniscule with just SHA1, mounting an attack is expensive, and most people won't be targets. To compromise your 288 bit hash, you would need a simultaneous collision of both SHA1 and MD5; that is a LOT harder to find than a collision of either... how much harder mathematically? I don't have a clue. The attacks on both are significantly different. Is your SHA+MD5 288 bit hash as strong as a SHA-256 bit hash? There's no rigorous information showing this. Is your SHA+MD5 hash as strong as a SHA-512 hash? Most certainly not.

What we do know is SHA+MD5 requires more bits of memory than is required to store a SHA256 hash; you will have to build and maintain custom code for this hash. It requires more CPU time to compute. If transmitting, more network traffic.
The hash is not standardized for use with existing network protocols or authentication algorithms; you can't go to a CA and request a X.509 certificate with "SHA(gzip(key))+MD5(gzip(key)) as the hash algorithm.

There are efficiencies lost by using non-standardized combinations of other algorithms.

Ultimately you would need to incur great expense to research the question about whether SHA1(data)+MD5(data) is robustly safe against all known attacks -- it would likely take millions of dollars in research funding to try to break SHA1+MD5. If your adversaries don't have that kind of money, and you don't mind using a slightly non-standard hash, then it seems like a good interim solution. If you are a government, you will want the extra bits to rigorously add security; because you have to worry about foreign governments who can fund their own crypto research and break your algorithms in secret.

With: SHA1(data) + MD5(data) The result is a hash with more bits; you now have a 288 bit hash. We CAN rigorously say this is no less secure than the least-secure of SHA1(data) and MD5(data); we can also say that for SHA1(data)+MD5(data) to be subject to attack, there must be an overlap in the collisions findable of SHA1(data) that are also collisions of MD5(data)... the attacks against MD5 and SHA1 are different, so overlap seems unlikely.

As for [+]; It doesn't have to be concatenation, you can do this with any combination of the two hashes, none of the bits in either the MD5 or the SHA1 hash are destroyed, PROVIDED the definition you use for the "+(a,b)" operation is ANY operation you want to pick such that the function f(a,b) is 1:1 and onto with a known inverse with respect to all inputs. That is if you can define two functions g and h such that g(f(a,b)) = a and h(f(a,b)) = b.

Concatenation is such a function, given that 'a' is a SHA1 hash, and 'b' is a MD5 hash; because all SHA1 hashes are 160 bits, and all MD5 hashes are 128 bits, the concatenation will have a fixed length. if you have
f(a,b) = concatenate(a,b)
You have 288 bits. You can define g as the function that yields the first 160 bits of its input, and h as the function that yields the last 128 bits of its input.

As long as you preserve all the hash bits, it doesn't matter in the least whether it is concatenation or addition; there are some definitions of "+" that destroy data, for example; pad the MD5 hash to 160 bits, XOR SHA1(data) and MD5(data) together, and yield the 160 bit result of XORing the two hashes.

The result is 160 bits which might be easier to break than either SHA1 or MD5. But if you make sure never to 'throw away' hash output bits, the result is stronger.

The increase in strength just might not be efficient.
@Rob: "If Joshua would have numerically added the two, then that might qualify as rolling your own. But concatenating them does not."
Yes it does! It's now a 288-bit hash function. As I said before, you CANNOT consider the individual hash functions as black-boxes; the information in one hash function is related to the information in the other one, because they have the same input. The functions themselves are even architecturally similar! By giving an attacker room to exploit correlations between the data and/or architectures, you could end up decreasing security without even knowing it. Worse still, you actually think you've *increased* security (and will no-doubt assure your users of that)!

Thanks also for helpfully explaining to me that hash functions are analogous to door locks, I wasn't aware that we in the security community could trivially dismiss the complexities of cryptographic functions for the sake of convenience. The analogy fails though because both locks are considered 'broken'. Sure it's going to take an attacker more time, but if my locksmith told me 'well this lock can be easily picked, so you should probably put a few of them on your door just to be safe', I'd be looking for a new locksmith (with better locks).

@Mysid: "You're right it has a great potential to reduce the chance of forgery; the chance of forgery is still miniscule with just SHA1, mounting an attack is expensive, and most people won't be targets. To compromise your 288 bit hash, you would need a simultaneous collision of both SHA1 and MD5; that is a LOT harder to find than a collision of either... how much harder mathematically?"

Let's set our hash function h(A) = SHA1(A) || MD5(A)
A brute force birthday attack against a 288-bit hash will take us time complexity 2^144.
A simple way we could improve this is by first finding a single collision in MD5 (time complexity 2^21), then checking whether this pair produces a collision in SHA1 (birthday attack: 2^80). The result is that we can reliably find collisions in h() in time complexity 2^101, which is significantly less than a brute force birthday attack against the hash as a whole (2^144). In effect, we've just reduced the cryptographic strength of the hash function to a 202-bit ideal hash.

It gets *much* better though. By a stroke of luck, Merkle-Damgard hash functions (of which MD5 and SHA-1 belong) have an interesting property that when you already have a collision, finding more collisions is trivial. As such, we just need to find *one* collision in MD5, apply the multi-collision method and we can have as many MD5 collisions as we want, which we feed into the SHA1 birthday attack. Using this method, we come to the conclusion that SHA1+MD5 is of time complexity (2^21 + 2^80 -- Note the '+' not '*'). That is, it's only about as good as a 160-bit ideal hash.

It gets *even better*!! Because SHA-1 also has a collision attack (2^51) better than a birthday attack (2^80), and it's also of the Merkle-Damgard architecture, we can actually reverse the method, first attacking SHA-1 (2^51) then using this as input into a birthday attack against MD5 (2^64). Because MD5 is smaller, this yields an attack time complexity of only about 2^64. We've now reduced the strength of SHA1+MD5 down from 288-bit to the strength of an ideal 128-bit hash.

We've managed to reduce the complexity of the SHA1+MD5 hash function from a presumed 288-bit ideal hash-function down to only a 128-bit ideal hash function using *only known attacks* in a few minutes. Given the similarity of the SHA1 and MD5 constructions, there's a very good possibility that this can be reduced even further through research. Thankfully, this research isn't called for, because no current standard is stupid enough to suggest we use this construction.

While it is true that SHA1+MD5 is better than SHA1 alone, or MD5 alone (there's an easy out if anyone feels the need to save face), however it's only trivially better and nowhere near as good as the hash-length would imply. I would suggest that if you must use two hash functions together as a way of increasing security in the absence of better hash functions, please seek the advice of a knowledgeable cryptographer on which hash functions to choose and how to combine them.

The morals of the story:

Further information about the pitfalls of concatenating hash functions:
@Mysid: "Is your SHA+MD5 hash as strong as a SHA-512 hash? Most certainly not."

I'm not suggesting that it is. I'm suggesting only that it appears stronger than either SHA or MD5 alone, _for documents signed at a point in time when better hashes weren't available_. If I was signing a document *today* I'd pick SHA-512 and some other hash algorithm hopefully not having similar weaknesses to SHA-512. (I'd also hope that the software offering two-hash verification would exclude weak combinations.)

"The hash is not standardized for use with existing network protocols..."

I recognize that and don't suggest that this is a good idea for network traffic.

Basically I'm lamenting that Phil Zimmermann didn't use two hashes for verification in his original version of PGP (and to a lesser degree that the X.509 cert doesn't allow for this either - but certs are a lot easier to invalidate and replace than a cryptographically signed legal document).

@Mike: Thank you for that discussion, it does help answer my question of "how good an idea is this?"
The point of gzip was to seriously frustrate normal collision hunting because most collisions of the hash function are no longer possible inputs to the hash function. The output of gzip doesn't contain chunks that can be trivially replaced.

SHA1 + MD5 (and yes + did mean concat) was something I saw published elsewhere, not something I came up with.

Diary Archives