After discussing with some friends on StartupSeeds, I came up with a generalized random sub array algorithm.
This algorithm can be implemented with different data-structures, and it’s complexity is \(O(n)\) or \(O(k\log\space k)\) for 2 specific implementations.

We need a data-structure that can contain integers.
It must have 2 methods which are add (adds an integer) and contains (whether an integer is in or not).
We also need to be able to iterate it, I will use foreach loop for that.

The Algorithm

Let \(A\) be the original array, with elements of type \(T\).
Let \(k\) be the number of elements we want in the sub array.
Let \(n=length(A)\).
Let \(D\) be the data-structure.

  1. Let \(B=T[k]\) (an array of type \(T\) with \(k\) elements)
  2. Let \(flag=(k\le n / 2)\)
  3. Let \(stop=(flag ? k : n-k)\)
  4. \(for(i=0;i\lt stop;i=i+1)\)
    1. \(while(true)\)
      1. Let \(key=random(0,n-1)\)
      2. \(if(!D.contains(key))\)
        1. \(D.add(key)\)
        2. \(break\)
  5. \(if(flag)\)
    1. Let \(p=0\)
    2. \(foreach(key\space in\space D)\)
      1. \(B[p]=A[key]\)
      2. \(p=p+1\)
  6. \(else\)
    1. \(keys=bool[n]\)
    2. \(foreach(key\space in\space D)\)
      1. \(keys[key]=true\)
    3. Let \(p=0\)
    4. \(for(i=0;i\lt k;i=i+1)\)
      1. \(while(keys[p])\)
        1. \(p=p+1\)
      2. \(B[i]=array[p]\)
      3. \(p=p+1\)
  7. \(return\space B\)

Complexity Analysis

I will calculate each stage’s complexity separately, and then sum them up.
But first:
Let \(T_c(j)\) be the complexity of \(D.contains\) when there are \(j\) elements in it.
Let \(T_a(j)\) be the complexity of \(D.add\) when there are \(j\) elements in it.
Let \(T_i(j)\) be the complexity of iterating the data-structure’s elements when there are \(j\) elements in it.

Stage 1 is \(O(k)\) or \(O(1)\) depending on the language’s implementation for array declarations, we will use \(O(k)\).

Stage 2 and 3 are clearly \(O(1)\).

Stage 4 is more complex to calculate and we can only calculate it for the average case.
In each of the iterations in the for-loop we are looking for a number between \(0\) and \(n-1\) that is not already in \(D\).
Because there are already \(i\) elements in the data-structure, the distribution of the number of while-loop iterations is a geometric distribution with probability \(\frac{n-i}{n}\), therefore the mean number of iterations is \(\frac{n}{n-i}\).
So this stage’s complexity is:
\[\displaylines{O(\sum_{i=0}^{min(k,n-k)-1}(T_c(i)\frac{n}{n-i}+T_a(i)))
\\
=O(\sum_{i=0}^{min(k,n-k)-1}(T_c(i)\frac{n}{n/2}+T_a(i)))
\\
=O(\sum_{i=0}^{min(k,n-k)-1}(T_c(i)+T_a(i)))}\]

Stage 5 is simply \(O(T_i(k))\).

Stage 6 is also quite simple.
There is an array deceleration (\(O(n)\) as mentioned before), iteration over the data-structure \(O(T_i(n-k))\) and an iteration on the arrays (\(O(n)\).
Summing up to \(O(T_i(n-k)+n)\).
But we know that \(k\ge n/2\), and therefore it is also \(O(T_i(n-k)+k)\).

Bear in mind that only one of stages 5 and 6 occurs, hence there total complexity of these stages is \(O(T_i(min(k,n-k))+k)\).

Stage 7 is also \(O(1)\).

Summing stages 1 to 7 gives us:
\[O(\sum_{i=0}^{min(k,n-k)-1}(T_c(i)+T_a(i))+T_i(min(k,n-k))+k)\]

Implementation #1 – Boolean Array

This implementation was introduced in my previous article, Generating Random Sub Array.
In this implementation the data-structure acts the same as the array \(keys\) that is used in stage 6.1.

In this case \(T_c(j)=1\), \(T_a(j)=1\) and \(T_i(j)=n\).
With that in mind the algorithms complexity can be found easily: \(O(\sum_{i=0}^{min(k,n-k)-1}(1+1)+n+k)=O(n)\)

Implementation #2 – Self-Balancing Binary Search Tree

In this implementation our data-structure is a binary search tree.
This implementation is much better than the one mentioned above when \(k\ll n\).

In this case \(T_c(j)=\log\space j\), \(T_a(j)=\log\space j\) and \(T_i(j)=j\).
We can again easily find the total time complexity.
For simplicity I will assume that \(k\le n/2\).
\(O(\sum_{i=0}^{k-1}(\log\space j+\log\space j)+k+k)=O(k\log\space k)\)

Tagged with:
 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Set your Twitter account name in your settings to use the TwitterBar Section.