After discussing with some friends on StartupSeeds, I came up with a generalized random sub array algorithm.
This algorithm can be implemented with different data-structures, and it’s complexity is $$O(n)$$ or $$O(k\log\space k)$$ for 2 specific implementations.

We need a data-structure that can contain integers.
It must have 2 methods which are add (adds an integer) and contains (whether an integer is in or not).
We also need to be able to iterate it, I will use foreach loop for that.

## The Algorithm

Let $$A$$ be the original array, with elements of type $$T$$.
Let $$k$$ be the number of elements we want in the sub array.
Let $$n=length(A)$$.
Let $$D$$ be the data-structure.

1. Let $$B=T[k]$$ (an array of type $$T$$ with $$k$$ elements)
2. Let $$flag=(k\le n / 2)$$
3. Let $$stop=(flag ? k : n-k)$$
4. $$for(i=0;i\lt stop;i=i+1)$$
1. $$while(true)$$
1. Let $$key=random(0,n-1)$$
2. $$if(!D.contains(key))$$
1. $$D.add(key)$$
2. $$break$$
5. $$if(flag)$$
1. Let $$p=0$$
2. $$foreach(key\space in\space D)$$
1. $$B[p]=A[key]$$
2. $$p=p+1$$
6. $$else$$
1. $$keys=bool[n]$$
2. $$foreach(key\space in\space D)$$
1. $$keys[key]=true$$
3. Let $$p=0$$
4. $$for(i=0;i\lt k;i=i+1)$$
1. $$while(keys[p])$$
1. $$p=p+1$$
2. $$B[i]=array[p]$$
3. $$p=p+1$$
7. $$return\space B$$

## Complexity Analysis

I will calculate each stage’s complexity separately, and then sum them up.
But first:
Let $$T_c(j)$$ be the complexity of $$D.contains$$ when there are $$j$$ elements in it.
Let $$T_a(j)$$ be the complexity of $$D.add$$ when there are $$j$$ elements in it.
Let $$T_i(j)$$ be the complexity of iterating the data-structure’s elements when there are $$j$$ elements in it.

Stage 1 is $$O(k)$$ or $$O(1)$$ depending on the language’s implementation for array declarations, we will use $$O(k)$$.

Stage 2 and 3 are clearly $$O(1)$$.

Stage 4 is more complex to calculate and we can only calculate it for the average case.
In each of the iterations in the for-loop we are looking for a number between $$0$$ and $$n-1$$ that is not already in $$D$$.
Because there are already $$i$$ elements in the data-structure, the distribution of the number of while-loop iterations is a geometric distribution with probability $$\frac{n-i}{n}$$, therefore the mean number of iterations is $$\frac{n}{n-i}$$.
So this stage’s complexity is:
$\displaylines{O(\sum_{i=0}^{min(k,n-k)-1}(T_c(i)\frac{n}{n-i}+T_a(i))) \\ =O(\sum_{i=0}^{min(k,n-k)-1}(T_c(i)\frac{n}{n/2}+T_a(i))) \\ =O(\sum_{i=0}^{min(k,n-k)-1}(T_c(i)+T_a(i)))}$

Stage 5 is simply $$O(T_i(k))$$.

Stage 6 is also quite simple.
There is an array deceleration ($$O(n)$$ as mentioned before), iteration over the data-structure $$O(T_i(n-k))$$ and an iteration on the arrays ($$O(n)$$.
Summing up to $$O(T_i(n-k)+n)$$.
But we know that $$k\ge n/2$$, and therefore it is also $$O(T_i(n-k)+k)$$.

Bear in mind that only one of stages 5 and 6 occurs, hence there total complexity of these stages is $$O(T_i(min(k,n-k))+k)$$.

Stage 7 is also $$O(1)$$.

Summing stages 1 to 7 gives us:
$O(\sum_{i=0}^{min(k,n-k)-1}(T_c(i)+T_a(i))+T_i(min(k,n-k))+k)$

## Implementation #1 – Boolean Array

This implementation was introduced in my previous article, Generating Random Sub Array.
In this implementation the data-structure acts the same as the array $$keys$$ that is used in stage 6.1.

In this case $$T_c(j)=1$$, $$T_a(j)=1$$ and $$T_i(j)=n$$.
With that in mind the algorithms complexity can be found easily: $$O(\sum_{i=0}^{min(k,n-k)-1}(1+1)+n+k)=O(n)$$

## Implementation #2 – Self-Balancing Binary Search Tree

In this implementation our data-structure is a binary search tree.
This implementation is much better than the one mentioned above when $$k\ll n$$.

In this case $$T_c(j)=\log\space j$$, $$T_a(j)=\log\space j$$ and $$T_i(j)=j$$.
We can again easily find the total time complexity.
For simplicity I will assume that $$k\le n/2$$.
$$O(\sum_{i=0}^{k-1}(\log\space j+\log\space j)+k+k)=O(k\log\space k)$$

Tagged with: