One of the problems of hash functions is collisions, they cause some security vulnerabilities which won’t be discussed in this article.
In this article I would like to find some approximations concerning collisions.

Probability for a collision

Let \(\mathbf{M}\) be the size of the range (i.e. for md5 it is \(2^{128}\) since md5 returns 128 bits).
Let \(\mathbf{n}\) be the number of  random values in this range.
We shall calculate \(p\), the probability for at least one collision.

First, we calculate \(\overline{p}\), the probability for no collisions:
The first value can be any one from \(\mathbf{M}\), the second value can be any one from \(\mathbf{M}-1\) different values, and so on.
Therefore, \(\overline{p}=\frac{\mathbf{M}}{\mathbf{M}}\frac{\mathbf{M}-1}{\mathbf{M}}…\frac{\mathbf{M}-(\mathbf{n}-1)}{\mathbf{M}}=\prod_{k=0}^{\mathbf{n}-1}\frac{\mathbf{M}-k}{\mathbf{M}}=\prod_{k=0}^{\mathbf{n}-1}(1-\frac{k}{\mathbf{M}})\).

Since \(p=1-\overline{p}\) it is clear that:
\[(1) p=1-\prod_{k=0}^{\mathbf{n}-1}(1-\frac{k}{\mathbf{M}})\]

Recall that \(e^x=\sum_{k=0}^{\infty}\frac{x^k}{k!}\approx \frac{x^0}{0!}+\frac{x^1}{1!}=1+x\), substituting \(x=-\frac{k}{\mathbf{M}}\) gives \(1-\frac{k}{\mathbf{M}}\approx e^{-\frac{k}{\mathbf{M}}}\).
Now, from \((1)\):
\(p\approx 1-\prod_{k=0}^{\mathbf{n}-1}e^{-\frac{k}{\mathbf{M}}}=1-e^{\sum_{k=0}^{\mathbf{n}-1}{-\frac{k}{\mathbf{M}}}}=1-e^{-\frac{1}{\mathbf{M}} \sum_{k=0}^{\mathbf{n}-1}k}=1-e^{-\frac{\mathbf{n}(\mathbf{n}-1)}{2\mathbf{M}}}\)
Thus:
\[(2) p\approx 1-e^{-\frac{\mathbf{n}(\mathbf{n}-1)}{2\mathbf{M}}}\]

Under the condition \(1\ll \mathbf{n}^2\ll 2\mathbf{M}\) an even simpler formula emerges.
Again, \(e^x\approx 1+x\), substituting \(x=-\frac{\mathbf{n}^2}{2\mathbf{M}}\) gives \(1–\frac{\mathbf{n}^2}{2\mathbf{M}}\approx e^{-\frac{\mathbf{n}^2}{2\mathbf{M}}}\).
From \((2)\):
\(p\approx 1-e^{-\frac{\mathbf{n}(\mathbf{n}-1)}{2\mathbf{M}}}\approx 1-e^{-\frac{\mathbf{n}^2}{2\mathbf{M}}}\approx 1-(1-\frac{\mathbf{n}^2}{2\mathbf{M}})=\frac{\mathbf{n}^2}{2\mathbf{M}}\)
Therefore:
\[(3) p\approx \frac{\mathbf{n}^2}{2\mathbf{M}}\]

Amount of values for a probable collision

Let \(\mathbf{M}\) be the size of the range (i.e. for md5 it is \(2^{128}\) since md5 returns 128 bits).
Let \(\mathbf{p}\) be the desired probability for at least one collision.
We shall calculate \(n\), the number of values such that the probability for a collision is \(p\).

From \((2)\):
\(\mathbf{p}\approx 1-e^{-\frac{n(n-1)}{2\mathbf{M}}}
\\
1-\mathbf{p}\approx e^{-\frac{n(n-1)}{2\mathbf{M}}}
\\
\log(1-\mathbf{p})\approx-\frac{n(n-1)}{2\mathbf{M}}
\\
n(n-1)\approx-2\mathbf{M}\log(1-\mathbf{p})=2\mathbf{M}\log \frac{1}{1-\mathbf{p}}
\\
n^2-n-2\mathbf{M}\log \frac{1}{1-\mathbf{p}}\approx0
\\
n\approx\frac{1+\sqrt{1-8\mathbf{M}\log \frac{1}{1-\mathbf{p}}}}{2}=\frac{1}{2}+\sqrt{\frac{1}{4}+2\mathbf{M}\log \frac{1}{1-\mathbf{p}}}\)
Hence:
\[(4) n\approx \frac{1}{2}+\sqrt{\frac{1}{4}+2\mathbf{M}\log \frac{1}{1-\mathbf{p}}}\]

If \(\mathbf{M}\mathbf{p}\gg 1\) the constants are negligible:
\[(5) n\approx\sqrt{2\mathbf{M}\log \frac{1}{1-\mathbf{p}}}\]

If in addition \(\mathbf{p}\ll 1\), formula \((3)\) implies (can be demonstrated using \((5)\) too):
\[(6) n\approx \sqrt{2\mathbf{M}\mathbf{p}}\]

Examples

Birthday problem

The birthday problem (often referred as birthday paradox) states that if \(23\) people are randomly selected, the probability that two of them share their birthdays is higher than \(\frac{1}{2}\).
In this case \(\mathbf{M}=365\), \(\mathbf{n}=23\) and \(\mathbf{p}=\frac{1}{2}\).

Using formula\( (1)\):
\(p=1-\prod_{k=0}^{22}(1-\frac{k}{365})=0.507297\)

Luckily, formula \( (2)\) produces a similar result:
\(p\approx 1-e^{-\frac{23\times 22}{2\times 365}}=1-e^{-\frac{253}{365}}=0.500002\)

The inverse can be calculated using formula \( (4)\):
\(n\approx \frac{1}{2}+\sqrt{\frac{1}{4}+2\times 365\log \frac{1}{1-\frac{1}{2}}}=\frac{1}{2}+\sqrt{\frac{1}{4}+730\log 2}=22.9999\)

Formulas \( (3)\), \( (5)\) and \( (6)\) are inappropriate in this case (however their results are not too far from the actual results).

md5 hash function

In this case \(\mathbf{M}=2^{128}\) as I have mentioned in the beginning.

Using formula \((5)\) (\(\mathbf{M}\) is very large) we can find \(n\) given \(\mathbf{p}\):
\(\mathbf{p}=\frac{1}{1000}\): \(n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{1}{1000}}}=8.25170\times 10^{17}\)
\(\mathbf{p}=\frac{1}{100}\): \(n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{1}{100}}}=2.61532\times 10^{18}\)
\(\mathbf{p}=\frac{1}{10}\): \(n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{1}{10}}}=8.4678\times 10^{18}\)
\(\mathbf{p}=\frac{1}{2}\): \(n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{1}{2}}}=2.17194\times 10^{19}\)
\(\mathbf{p}=\frac{3}{4}\): \(n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{3}{4}}}=3.07158\times 10^{19}\)
\(\mathbf{p}=\frac{99}{100}\): \(n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{99}{100}}}=5.59832\times 10^{19}\)

For \(\mathbf{p}=\frac{1}{1000}, \frac{1}{100}, \frac{1}{10}\) formula \((6)\) can be used (higher values give bad results):
\(\mathbf{p}=\frac{1}{1000}\): \(n\approx\sqrt{2\times 2^{128} \frac{1}{1000}}=8.24963\times 10^{17}\)
\(\mathbf{p}=\frac{1}{100}\): \(n\approx\sqrt{2\times 2^{128} \frac{1}{100}}=2.60876\times 10^{18}\)
\(\mathbf{p}=\frac{1}{10}\): \(n\approx\sqrt{2\times 2^{128} \frac{1}{10}}=8.24963\times 10^{18}\)
These results are consistent with the previous results.

Summary

We have obtained some formulas that relate \(\mathbf{M}\), \(\mathbf{n}\) and \(\mathbf{p}\):
\[(1) p=1-\prod_{k=0}^{\mathbf{n}-1}(1-\frac{k}{\mathbf{M}})
\\
(2) p\approx 1-e^{-\frac{\mathbf{n}(\mathbf{n}-1)}{2\mathbf{M}}}
\\
(3) p\approx \frac{\mathbf{n}^2}{2\mathbf{M}} (1\ll \mathbf{n}^2\ll 2\mathbf{M})
\\
(4) n\approx \frac{1}{2}+\sqrt{\frac{1}{4}+2\mathbf{M}\log \frac{1}{1-\mathbf{p}}}
\\
(5) n\approx\sqrt{2\mathbf{M}\log \frac{1}{1-\mathbf{p}}} (\mathbf{M}\mathbf{p}\gg 1)
\\
(6) n\approx \sqrt{2\mathbf{M}\mathbf{p}} (\mathbf{Mp}\gg 1, \mathbf{p}\ll 1)\]

Tagged with:
 

2 Responses to Hash Collision Probability

  1. SonOfLilit says:

    I felt bad with using a 1st order taylor approximation of e^x.

    Graphing with WA: http://www.wolframalpha.com/input/?i=e^x+-+%281+%2B+x%29

    What is the relative error? http://www.wolframalpha.com/input/?i=e^x%2F%281+%2B+x%29

    But since we’re using -k/M as x, and k <= n, and n < M, we're probably near 0 from the negative side.

    Lets ask WA "when is the error acceptable?": http://www.wolframalpha.com/input/?i=0.5+%3C+e^x%2F%281+%2B+x%29+%3C+1.5

    It answers "between ~-0.6 and ~1.1". So we seem to be safe.

    All in all: Well played! Interesting read.

    • Thank you for the comment.

      Also remember that the relative is the biggest on the last term of our product: 1-n/M.
      Usually n is on the order of Sqrt(M) or lower, so the approximation is for |x| When M is big enough this number is tiny and the approximation is wonderful.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Set your Twitter account name in your settings to use the TwitterBar Section.