In this post I present a simpler proof to derive expected asymptotic bounds for the
Randomized Selection algorithm --
Las Vegas version to be more precise. Given a set $S = \{ a_1\,a_2\,a_3\ldots a_n \}$ of keys and an integer $k \in [1,n]$ the
selection problem asks to report the $k^{th}$ largest key in the set $S$. Recall that this problem can be solved optimally using $O(n)$ comparison operations using the BFPRT (Blum, Floyd, Pratt, Rivest and Tarjan 1973) algorithm. However a correct implementation of BFPRT algorithm may take a lot of code. On the other hand randomized algorithms for many deterministic algorithms are often simpler to implement. The algorithm chooses a random pivot for partitioning the input and recursively picks on of the partition until the partition (sub problem) size is just one. Following is an implementation of
RandSelect in around 25 lines.
Our goal here is the analyze the
expected run-time and
expected number of steps for the problem size converge to
unity -- more precisely the expected value of the variable
rand_walk_steps in the following code. I have purposefully kept the analysis very rigorous to make it accessible to everyone.
TL;DR "Expected number of comparisons in
RandSelect is $\Theta(n)$ and expected number of
rand_walk_steps is $\Theta(\log(n))$".
// Randomized selection algorithm to pick k^th largest element
// in a mutable array.
template < typename T >
T RandSelect(T *array, size_t n, size_t k) {
assert(k > 0 && k <= n);
size_t problem_size = n, select = k;
// What is the expected value of rand_walk_steps ??
size_t rand_walk_steps = 0;
while (problem_size > 1) {
size_t i = 0, j = problem_size - 1;
size_t ridx = RandArrayIndex(problem_size);
const T pivot = array[ridx];
// Partition the sub problem with pivot.
while (i <= j) {
if (array[i] <= pivot) {
i++;
} else if (array[j] > pivot) {
j--;
} else {
std::swap(array[i], array[j]);
}
}
problem_size = (i >= select) ? i : problem_size - i;
array = (i >= select) ? array : &array[i];
select = (i >= select) ? select : select - i;
++rand_walk_steps;
}
return array[0];
}
Let $X_i$ be the random variable corresponding to problem size in $i^{th}$ step ($1 \geq i \geq n-1$) in the algorithm -- each step corresponds to one iteration of the outer most
while loop. If the algorithm terminates in $j$ steps then all $X_k$ with $k>j$ are set to $0$.
$$
\begin{array}{lll}
X_i &=& \text{Random variable for problem size in }i^{th}\text{ iteration of the outer most}\textsf{ while}\text{ loop} \\
X_i && \left\{ \begin{array}{lr} \in [1,X_{i-1}-1] & \textsf{If } X_{i-1} > 1 \\
= 0 & \textsf{Otherwise}
\end{array} \right. \\
X_0 &=& n \,\,\,\,\,\,\, \text{Initial problem size and }\,\,\, E[X_0] = n \\
&& \\
\end{array}
$$
Lemma-1: If $X_{i-1} >1$ Then $E[X_i | X_{i-1}] = \frac{X_{i-1}}{2}$
Proof: For a given $X_{i-1} > 1$ the random variable $X_{i}$ is uniformly distributed in the range $[1,X_{i-1}-1]$. Hence the expected value is simply the average $\frac{1 + (X_{i-1}-1)}{2} = \frac{X_{i-1}}{2}$ $\Box$.
Lemma-2: $E[X_i] = \frac{n}{2^i}$
Proof:
$$
\begin{array}{llll}
&E[X_i|X_{i-1}] &=& \frac{X_{i-1}}{2}\,\,\,\,\,\text{From Lemma-1}\\
&E[\,E[X_i|X_{i-1}]\,] &=& E[\frac{X_{i-1}}{2}] \\
\Rightarrow & E[X_i] &=& E[\frac{X_{i-1}}{2}] = \frac{E[X_{i-1}]}{2}\,\,\,\,\,\text{for any two random variables } X,Y\,\,E[E[X|Y]] = E[X]\\
&&=&\frac{1}{2}\times E[X_{i-1}] = \frac{1}{2^2}\times E[X_{i-2}] \ldots \,\,\,\,\, \text{Apply recursively} \\
&&\ldots& \\
&&=& \frac{1}{2^i}\times E[X_{i-i}] = \frac{E[X_0]}{2^i} = \frac{n}{2^i} \,\, \Box\\
\end{array}
$$
Theorem-1: Expected number of comparisons done by RandSelect is $\Theta(n)$.
Proof: The number comparisons done by the algorithm in each iteration is $\leq 3\times X_i$. We at most $2$ comparisons in the inner if..else branch and $1$ comparison to test $i <= j$, so a total of $3$ comparisons in the each iteration of the inner most while loop. The loop runs at most $X_i$ times in the $i^{th}$ iteration of the outer most while loop. Finally we need $1$ comparison to test the termination of the while loop. Let $T$ be the random variable corresponding the total number of comparisons in the algorithm. So we need to find $E[T]$.
$$
\begin{array}{ll}
&T \leq 3(X_0 + X_1 + X_2 + \ldots X_{n-1}) + 1 & \text{Total number of comparisons in the Algorithm. }\\
&&\\
\Rightarrow& \sum_{i=0}^{n-1} X_i \leq T \leq 3\sum_{i=0}^{n-1} X_{i} + 1 & \\
& \sum_{i=0}^{n-1} E[X_i] \leq E[T] \leq 3\sum_{i=0}^{n-1} E[X_{i}] + 1 & \text{Linearity of expectation.} \\
&&\\
& \sum_{i=0}^{n-1} \frac{n}{2^i} \leq E[T] \leq 3\sum_{i=0}^{n-1} \frac{n}{2^i} + n & \text{Apply Lemma-2.} \\
& 2n\left(1-\frac{1}{2^n}\right) \leq E[T] \leq 6n\left(1-\frac{1}{2^n}\right) +n & \text{Use: } \sum_{i=0}^{n-1}\frac{1}{2^i} = \frac{1-1/2^n}{1-1/2}.\\
&&\\
\Rightarrow &n \leq E[T] < 6n + 1 = 6n +1 & \text{Use: for } n\geq 1\,\,\,,\frac{1}{2} \leq \left(1-\frac{1}{2^n}\right) < 1 \\
&&\\
\Rightarrow & E[T] = \Theta(n)\,\,\Box & \\
\end{array}
$$
We now return the question of the expected number of steps for the problem size to reach unity (i.e. expected number of iterations of the outer most
while loop). This is also referred as the expected number of
recursive calls in Motwani and Raghavan's book. However the following analysis is much simpler than the application of a calculus based lemma based on random walks in the book. Let $Y_{i}$ be a random variable corresponding to the number of iterations starting with a problem size of $X_{i} > 1$.
$$
\begin{array}{lll}
Y_{i} &:=&\text{random variable for the number of steps with problem size }\,\, X_{i} >1 \\
Y_i &=& 1 + Y_{i+1} \,\,\,\,\,\, \text{(Recursive formulation)}\\
Y = Y_0 &=& \underbrace{1 + 1 + 1 \dots + 1}_{k \text{-times}} + Y_{k} \,\,\,\,\,\, \text{(Expand } k \text{ times)}\\
&& \\
\end{array}
$$
Theorem-2: Expected number of steps for the problem size to reach unity, $E[Y] = \Theta(\log(n))$.
Proof:
$$
\begin{array}{ll}
&Y_i \leq X_{i}\,\,\,\, \text{ (Number of steps are at most problem size at } i^{th} \text{ step which is } X_i)\\
& \\
\Rightarrow &E[Y_i] \leq E[X_i] \,\,\,\, \text{( Let } Z = X_i-Y_i\,\,\,\,,\,\,\,\, E[Z] = E[X_i]-E[Y_i] \geq 0 \text{)}\\
&E[Y_i] \leq E[X_i] = \frac{n}{2^i} \,\,\,\, \text{ (Apply Lemma-2)} \\
&Y = Y_0 = \underbrace{1 + 1 + 1 \dots + 1}_{t \text{-times}} + Y_{t} \\
\Rightarrow & E[Y] = \underbrace{E[1] + E[1] \dots + E[1]}_{t \text{-times}} + E[Y_{t}] = t + E[Y_{t}] \\
&E[Y] = t + E[Y_{t}] \leq t + E[X_{t}] \leq t + \frac{n}{2^t} \\
&\\
\Rightarrow &t \leq E[Y] \leq t + \frac{n}{2^t} \,\,\,\, \text{After } t \text{ steps }\\
& \\
&\text{When } t = \log_2(n) \text{ the above inequality becomes the following } \\
& \log_2(n) \leq E[Y] \leq \log_2(n) + \frac{n}{2^{log_2(n)}} = \log_2(n) +1 \\
\Rightarrow & E[Y] =\Theta(log(n)) \,\, \Box \\
\end{array}
$$
This completes our expected analysis of the
RandSelect algorithm. The analysis just uses elementary probability theory without relying on a Calculus based random walk theorem in the
"Randomized Algorithms" book by Motwani and Raghavan -- (
Theorem-1.3 on page-15).