One of the problems of hash functions is collisions, they cause some security vulnerabilities which won’t be discussed in this article.
In this article I would like to find some approximations concerning collisions.

## Probability for a collision

Let $$\mathbf{M}$$ be the size of the range (i.e. for md5 it is $$2^{128}$$ since md5 returns 128 bits).
Let $$\mathbf{n}$$ be the number of  random values in this range.
We shall calculate $$p$$, the probability for at least one collision.

First, we calculate $$\overline{p}$$, the probability for no collisions:
The first value can be any one from $$\mathbf{M}$$, the second value can be any one from $$\mathbf{M}-1$$ different values, and so on.
Therefore, $$\overline{p}=\frac{\mathbf{M}}{\mathbf{M}}\frac{\mathbf{M}-1}{\mathbf{M}}…\frac{\mathbf{M}-(\mathbf{n}-1)}{\mathbf{M}}=\prod_{k=0}^{\mathbf{n}-1}\frac{\mathbf{M}-k}{\mathbf{M}}=\prod_{k=0}^{\mathbf{n}-1}(1-\frac{k}{\mathbf{M}})$$.

Since $$p=1-\overline{p}$$ it is clear that:
$(1) p=1-\prod_{k=0}^{\mathbf{n}-1}(1-\frac{k}{\mathbf{M}})$

Recall that $$e^x=\sum_{k=0}^{\infty}\frac{x^k}{k!}\approx \frac{x^0}{0!}+\frac{x^1}{1!}=1+x$$, substituting $$x=-\frac{k}{\mathbf{M}}$$ gives $$1-\frac{k}{\mathbf{M}}\approx e^{-\frac{k}{\mathbf{M}}}$$.
Now, from $$(1)$$:
$$p\approx 1-\prod_{k=0}^{\mathbf{n}-1}e^{-\frac{k}{\mathbf{M}}}=1-e^{\sum_{k=0}^{\mathbf{n}-1}{-\frac{k}{\mathbf{M}}}}=1-e^{-\frac{1}{\mathbf{M}} \sum_{k=0}^{\mathbf{n}-1}k}=1-e^{-\frac{\mathbf{n}(\mathbf{n}-1)}{2\mathbf{M}}}$$
Thus:
$(2) p\approx 1-e^{-\frac{\mathbf{n}(\mathbf{n}-1)}{2\mathbf{M}}}$

Under the condition $$1\ll \mathbf{n}^2\ll 2\mathbf{M}$$ an even simpler formula emerges.
Again, $$e^x\approx 1+x$$, substituting $$x=-\frac{\mathbf{n}^2}{2\mathbf{M}}$$ gives $$1–\frac{\mathbf{n}^2}{2\mathbf{M}}\approx e^{-\frac{\mathbf{n}^2}{2\mathbf{M}}}$$.
From $$(2)$$:
$$p\approx 1-e^{-\frac{\mathbf{n}(\mathbf{n}-1)}{2\mathbf{M}}}\approx 1-e^{-\frac{\mathbf{n}^2}{2\mathbf{M}}}\approx 1-(1-\frac{\mathbf{n}^2}{2\mathbf{M}})=\frac{\mathbf{n}^2}{2\mathbf{M}}$$
Therefore:
$(3) p\approx \frac{\mathbf{n}^2}{2\mathbf{M}}$

## Amount of values for a probable collision

Let $$\mathbf{M}$$ be the size of the range (i.e. for md5 it is $$2^{128}$$ since md5 returns 128 bits).
Let $$\mathbf{p}$$ be the desired probability for at least one collision.
We shall calculate $$n$$, the number of values such that the probability for a collision is $$p$$.

From $$(2)$$:
$$\mathbf{p}\approx 1-e^{-\frac{n(n-1)}{2\mathbf{M}}} \\ 1-\mathbf{p}\approx e^{-\frac{n(n-1)}{2\mathbf{M}}} \\ \log(1-\mathbf{p})\approx-\frac{n(n-1)}{2\mathbf{M}} \\ n(n-1)\approx-2\mathbf{M}\log(1-\mathbf{p})=2\mathbf{M}\log \frac{1}{1-\mathbf{p}} \\ n^2-n-2\mathbf{M}\log \frac{1}{1-\mathbf{p}}\approx0 \\ n\approx\frac{1+\sqrt{1-8\mathbf{M}\log \frac{1}{1-\mathbf{p}}}}{2}=\frac{1}{2}+\sqrt{\frac{1}{4}+2\mathbf{M}\log \frac{1}{1-\mathbf{p}}}$$
Hence:
$(4) n\approx \frac{1}{2}+\sqrt{\frac{1}{4}+2\mathbf{M}\log \frac{1}{1-\mathbf{p}}}$

If $$\mathbf{M}\mathbf{p}\gg 1$$ the constants are negligible:
$(5) n\approx\sqrt{2\mathbf{M}\log \frac{1}{1-\mathbf{p}}}$

If in addition $$\mathbf{p}\ll 1$$, formula $$(3)$$ implies (can be demonstrated using $$(5)$$ too):
$(6) n\approx \sqrt{2\mathbf{M}\mathbf{p}}$

## Examples

### Birthday problem

The birthday problem (often referred as birthday paradox) states that if $$23$$ people are randomly selected, the probability that two of them share their birthdays is higher than $$\frac{1}{2}$$.
In this case $$\mathbf{M}=365$$, $$\mathbf{n}=23$$ and $$\mathbf{p}=\frac{1}{2}$$.

Using formula$$(1)$$:
$$p=1-\prod_{k=0}^{22}(1-\frac{k}{365})=0.507297$$

Luckily, formula $$(2)$$ produces a similar result:
$$p\approx 1-e^{-\frac{23\times 22}{2\times 365}}=1-e^{-\frac{253}{365}}=0.500002$$

The inverse can be calculated using formula $$(4)$$:
$$n\approx \frac{1}{2}+\sqrt{\frac{1}{4}+2\times 365\log \frac{1}{1-\frac{1}{2}}}=\frac{1}{2}+\sqrt{\frac{1}{4}+730\log 2}=22.9999$$

Formulas $$(3)$$, $$(5)$$ and $$(6)$$ are inappropriate in this case (however their results are not too far from the actual results).

### md5 hash function

In this case $$\mathbf{M}=2^{128}$$ as I have mentioned in the beginning.

Using formula $$(5)$$ ($$\mathbf{M}$$ is very large) we can find $$n$$ given $$\mathbf{p}$$:
$$\mathbf{p}=\frac{1}{1000}$$: $$n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{1}{1000}}}=8.25170\times 10^{17}$$
$$\mathbf{p}=\frac{1}{100}$$: $$n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{1}{100}}}=2.61532\times 10^{18}$$
$$\mathbf{p}=\frac{1}{10}$$: $$n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{1}{10}}}=8.4678\times 10^{18}$$
$$\mathbf{p}=\frac{1}{2}$$: $$n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{1}{2}}}=2.17194\times 10^{19}$$
$$\mathbf{p}=\frac{3}{4}$$: $$n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{3}{4}}}=3.07158\times 10^{19}$$
$$\mathbf{p}=\frac{99}{100}$$: $$n\approx\sqrt{2\times 2^{128} \log \frac{1}{1-\frac{99}{100}}}=5.59832\times 10^{19}$$

For $$\mathbf{p}=\frac{1}{1000}, \frac{1}{100}, \frac{1}{10}$$ formula $$(6)$$ can be used (higher values give bad results):
$$\mathbf{p}=\frac{1}{1000}$$: $$n\approx\sqrt{2\times 2^{128} \frac{1}{1000}}=8.24963\times 10^{17}$$
$$\mathbf{p}=\frac{1}{100}$$: $$n\approx\sqrt{2\times 2^{128} \frac{1}{100}}=2.60876\times 10^{18}$$
$$\mathbf{p}=\frac{1}{10}$$: $$n\approx\sqrt{2\times 2^{128} \frac{1}{10}}=8.24963\times 10^{18}$$
These results are consistent with the previous results.

## Summary

We have obtained some formulas that relate $$\mathbf{M}$$, $$\mathbf{n}$$ and $$\mathbf{p}$$:
$(1) p=1-\prod_{k=0}^{\mathbf{n}-1}(1-\frac{k}{\mathbf{M}}) \\ (2) p\approx 1-e^{-\frac{\mathbf{n}(\mathbf{n}-1)}{2\mathbf{M}}} \\ (3) p\approx \frac{\mathbf{n}^2}{2\mathbf{M}} (1\ll \mathbf{n}^2\ll 2\mathbf{M}) \\ (4) n\approx \frac{1}{2}+\sqrt{\frac{1}{4}+2\mathbf{M}\log \frac{1}{1-\mathbf{p}}} \\ (5) n\approx\sqrt{2\mathbf{M}\log \frac{1}{1-\mathbf{p}}} (\mathbf{M}\mathbf{p}\gg 1) \\ (6) n\approx \sqrt{2\mathbf{M}\mathbf{p}} (\mathbf{Mp}\gg 1, \mathbf{p}\ll 1)$

Tagged with:

### 2 Responses to Hash Collision Probability

1. SonOfLilit says:

I felt bad with using a 1st order taylor approximation of e^x.

Graphing with WA: http://www.wolframalpha.com/input/?i=e^x+-+%281+%2B+x%29

What is the relative error? http://www.wolframalpha.com/input/?i=e^x%2F%281+%2B+x%29

But since we’re using -k/M as x, and k <= n, and n < M, we're probably near 0 from the negative side.

Lets ask WA "when is the error acceptable?": http://www.wolframalpha.com/input/?i=0.5+%3C+e^x%2F%281+%2B+x%29+%3C+1.5

It answers "between ~-0.6 and ~1.1". So we seem to be safe.

All in all: Well played! Interesting read.

• Thank you for the comment.

Also remember that the relative is the biggest on the last term of our product: 1-n/M.
Usually n is on the order of Sqrt(M) or lower, so the approximation is for |x| When M is big enough this number is tiny and the approximation is wonderful.