Hash Collision Probability
One of the problems of hash functions is collisions, they cause some security vulnerabilities which won’t be discussed in this article.
In this article I would like to find some approximations concerning collisions.
Probability for a collision
Let \(\mathbf{M}\) be the size of the range (i.e. for md5 it is \(2^{128}\) since md5 returns 128 bits).
Let \(\mathbf{n}\) be the number of random values in this range.
We shall calculate \(p\), the probability for at least one collision.
First, we calculate \(\overline{p}\), the probability for no collisions:
The first value can be any one from \(\mathbf{M}\), the second value can be any one from \(\mathbf{M}-1\) different values, and so on.
Therefore, \(\overline{p}=\frac{\mathbf{M}}{\mathbf{M}}\frac{\mathbf{M}-1}{\mathbf{M}}…\frac{\mathbf{M}-(\mathbf{n}-1)}{\mathbf{M}}=\prod_{k=0}^{\mathbf{n}-1}\frac{\mathbf{M}-k}{\mathbf{M}}=\prod_{k=0}^{\mathbf{n}-1}(1-\frac{k}{\mathbf{M}})\).
Since \(p=1-\overline{p}\) it is clear that:
\[(1) p=1-\prod_{k=0}^{\mathbf{n}-1}(1-\frac{k}{\mathbf{M}})\]
Recall that \(e^x=\sum_{k=0}^{\infty}\frac{x^k}{k!}\approx \frac{x^0}{0!}+\frac{x^1}{1!}=1+x\), substituting \(x=-\frac{k}{\mathbf{M}}\) gives \(1-\frac{k}{\mathbf{M}}\approx e^{-\frac{k}{\mathbf{M}}}\).
Now, from \((1)\):
\(p\approx 1-\prod_{k=0}^{\mathbf{n}-1}e^{-\frac{k}{\mathbf{M}}}=1-e^{\sum_{k=0}^{\mathbf{n}-1}{-\frac{k}{\mathbf{M}}}}=1-e^{-\frac{1}{\mathbf{M}} \sum_{k=0}^{\mathbf{n}-1}k}=1-e^{-\frac{\mathbf{n}(\mathbf{n}-1)}{2\mathbf{M}}}\)
Thus:
\[(2) p\approx 1-e^{-\frac{\mathbf{n}(\mathbf{n}-1)}{2\mathbf{M}}}\]
Under the condition \(1\ll \mathbf{n}^2\ll 2\mathbf{M}\) an even simpler formula emerges.
Again, \(e^x\approx 1+x\), substituting \(x=-\frac{\mathbf{n}^2}{2\mathbf{M}}\) gives \(1–\frac{\mathbf{n}^2}{2\mathbf{M}}\approx e^{-\frac{\mathbf{n}^2}{2\mathbf{M}}}\).
From \((2)\):
\(p\approx 1-e^{-\frac{\mathbf{n}(\mathbf{n}-1)}{2\mathbf{M}}}\approx 1-e^{-\frac{\mathbf{n}^2}{2\mathbf{M}}}\approx 1-(1-\frac{\mathbf{n}^2}{2\mathbf{M}})=\frac{\mathbf{n}^2}{2\mathbf{M}}\)
Therefore:
\[(3) p\approx \frac{\mathbf{n}^2}{2\mathbf{M}}\]
Amount of values for a probable collision
Let \(\mathbf{M}\) be the size of the range (i.e. for md5 it is \(2^{128}\) since md5 returns 128 bits).
Let \(\mathbf{p}\) be the desired probability for at least one collision.
We shall calculate \(n\), the number of values such that the probability for a collision is \(p\).
From \((2)\):
\(\mathbf{p}\approx 1-e^{-\frac{n(n-1)}{2\mathbf{M}}}
\\
1-\mathbf{p}\approx e^{-\frac{n(n-1)}{2\mathbf{M}}}
\\
\log(1-\mathbf{p})\approx-\frac{n(n-1)}{2\mathbf{M}}
\\
n(n-1)\approx-2\mathbf{M}\log(1-\mathbf{p})=2\mathbf{M}\log \frac{1}{1-\mathbf{p}}
\\
n^2-n-2\mathbf{M}\log \frac{1}{1-\mathbf{p}}\approx0
\\
n\approx\frac{1+\sqrt{1-8\mathbf{M}\log \frac{1}{1-\mathbf{p}}}}{2}=\frac{1}{2}+\sqrt{\frac{1}{4}+2\mathbf{M}\log \frac{1}{1-\mathbf{p}}}\)
Hence:
\[(4) n\approx \frac{1}{2}+\sqrt{\frac{1}{4}+2\mathbf{M}\log \frac{1}{1-\mathbf{p}}}\]
If \(\mathbf{M}\mathbf{p}\gg 1\) the constants are negligible:
\[(5) n\approx\sqrt{2\mathbf{M}\log \frac{1}{1-\mathbf{p}}}\]
If in addition \(\mathbf{p}\ll 1\), formula \((3)\) implies (can be demonstrated using \((5)\) too):
\[(6) n\approx \sqrt{2\mathbf{M}\mathbf{p}}\]
Examples
Birthday problem
The birthday problem (often referred as birthday paradox) states that if \(23\) people are randomly selected, the probability that two of them share their birthdays is higher than \(\frac{1}{2}\).
In this case \(\mathbf{M}=365\), \(\mathbf{n}=23\) and \(\mathbf{p}=\frac{1}{2}\).
Using formula\( (1)\):
\(p=1-\prod_{k=0}^{22}(1-\frac{k}{365})=0.507297\)
Luckily, formula \( (2)\) produces a similar result:
\(p\approx 1-e^{-\frac{23\times 22}{2\times 365}}=1-e^{-\frac{253}{365}}=0.500002\)
The inverse can be calculated using formula \( (4)\):
\(n\approx \frac{1}{2}+\sqrt{\frac{1}{4}+2\times 365\log \frac{1}{1-\frac{1}{2}}}=\frac{1}{2}+\sqrt{\frac{1}{4}+730\log 2}=22.9999\)
Formulas \( (3)\), \( (5)\) and \( (6)\) are inappropriate in this case (however their results are not too far from the actual results).
md5 hash function
In this case \(\mathbf{M}=2^{128}\) as I have mentioned in the beginning.
Using formula \((5)\) (\(\mathbf{M}\) is very large) we can find \(n\) given \(\mathbf{p}\):
\(\mathbf{p}=\frac{1}{1000}\): \(n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{1}{1000}}}=8.25170\times 10^{17}\)
\(\mathbf{p}=\frac{1}{100}\): \(n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{1}{100}}}=2.61532\times 10^{18}\)
\(\mathbf{p}=\frac{1}{10}\): \(n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{1}{10}}}=8.4678\times 10^{18}\)
\(\mathbf{p}=\frac{1}{2}\): \(n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{1}{2}}}=2.17194\times 10^{19}\)
\(\mathbf{p}=\frac{3}{4}\): \(n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{3}{4}}}=3.07158\times 10^{19}\)
\(\mathbf{p}=\frac{99}{100}\): \(n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{99}{100}}}=5.59832\times 10^{19}\)
For \(\mathbf{p}=\frac{1}{1000}, \frac{1}{100}, \frac{1}{10}\) formula \((6)\) can be used (higher values give bad results):
\(\mathbf{p}=\frac{1}{1000}\): \(n\approx\sqrt{2\times 2^{128} \frac{1}{1000}}=8.24963\times 10^{17}\)
\(\mathbf{p}=\frac{1}{100}\): \(n\approx\sqrt{2\times 2^{128} \frac{1}{100}}=2.60876\times 10^{18}\)
\(\mathbf{p}=\frac{1}{10}\): \(n\approx\sqrt{2\times 2^{128} \frac{1}{10}}=8.24963\times 10^{18}\)
These results are consistent with the previous results.
Summary
We have obtained some formulas that relate \(\mathbf{M}\), \(\mathbf{n}\) and \(\mathbf{p}\):
\[(1) p=1-\prod_{k=0}^{\mathbf{n}-1}(1-\frac{k}{\mathbf{M}})
\\
(2) p\approx 1-e^{-\frac{\mathbf{n}(\mathbf{n}-1)}{2\mathbf{M}}}
\\
(3) p\approx \frac{\mathbf{n}^2}{2\mathbf{M}} (1\ll \mathbf{n}^2\ll 2\mathbf{M})
\\
(4) n\approx \frac{1}{2}+\sqrt{\frac{1}{4}+2\mathbf{M}\log \frac{1}{1-\mathbf{p}}}
\\
(5) n\approx\sqrt{2\mathbf{M}\log \frac{1}{1-\mathbf{p}}} (\mathbf{M}\mathbf{p}\gg 1)
\\
(6) n\approx \sqrt{2\mathbf{M}\mathbf{p}} (\mathbf{Mp}\gg 1, \mathbf{p}\ll 1)\]
2 Responses to Hash Collision Probability
Leave a Reply Cancel reply
Categories





I felt bad with using a 1st order taylor approximation of e^x.
Graphing with WA: http://www.wolframalpha.com/input/?i=e^x+-+%281+%2B+x%29
What is the relative error? http://www.wolframalpha.com/input/?i=e^x%2F%281+%2B+x%29
But since we’re using -k/M as x, and k <= n, and n < M, we're probably near 0 from the negative side.
Lets ask WA "when is the error acceptable?": http://www.wolframalpha.com/input/?i=0.5+%3C+e^x%2F%281+%2B+x%29+%3C+1.5
It answers "between ~-0.6 and ~1.1". So we seem to be safe.
All in all: Well played! Interesting read.
Thank you for the comment.
Also remember that the relative is the biggest on the last term of our product: 1-n/M.
When M is big enough this number is tiny and the approximation is wonderful.
Usually n is on the order of Sqrt(M) or lower, so the approximation is for |x|