Tuesday, July 11, 2017

A matrix formulation of extended Euclid algorithm.

I have written about Extended Euclid Algorithm in the past. However more recently while solving a problem I had to express it in a matrix form which I thought is something worthwhile.

Recall the problem statement: Given two non-zero integers $a,b$ (assume $a\geq b$), the problem is to determine integers $x,y$ such that $\mbox{gcd}(a,b) = a\times x + b \times y$. Notice that the original Euclid's algorithm can compute the gcd but not the coefficients $x,y$ whose linear combination yields the gcd. The idea here is to use the quotient instead of the remainder as in the original algorithm. We express the operands in each iteration a linear combination of the original inputs and use the following to update as follows. The quotient in $i^{th}$ iteration is denoted by $q_i$ leading to the following formulation.


\mbox{gcd}(a,b) = \mbox{ExGcd}(\left[\begin{array}{c} a \\ b \end{array}\right]) &\\
\left [
x_{0,0} & y_{0,0}\\
x_{1,0} & y_{1,0}
\right ] = \left [ \begin{array}{cc}1 & 0\\ 0 & 1\end{array} \right ] & \mbox{initialization } i=0\\

\mbox{ExGcd}(\left [\begin{array}{cc} a & b \end{array}\right ] \left [ \begin{array}{cc} x_{0,i} & y_{0,i} \\ x_{1.i} & y_{1,i} \end{array} \right]) = \mbox{ExGcd}(\left[\begin{array}{cc} a & b \end{array}\right]
\left [
x_{0,i-1} & y_{0,i-1} \\
x_{1,i-1} & y_{1,i-1}
\right ]
\left [
0 & 1 \\
1 & -q_{i-1}
\right ]) & i > 0 \\

q_i = \mbox{Quotient}(x_{0,i}a+x_{1,i}b, y_{0,i}a+y_{1,i}b) & \\

Following is an implementation using std::div:
// Input: a,b
// Output: x, y (y0,y1)
void ExtendedEuclid(long a, long b, long& y0, long& y1)
 long x0, x1, t0, t1;

 if (a < b) { std::swap(a, b); }
 // Initialize:
 x0 = 1; x1 = 0; // a = 1 * a + 0 * b
 y0 = 0; y1 = 1; // b = 0 * a + 1 * b
 std::ldiv_t div_res = std::div(a, b);
 while (div_res.rem) {
  t0 = y0; t1 = y1;

  y0 = x0 - (div_res.quot * y0);
  y1 = x1 - (div_res.quot * y1);
  x0 = t0; x1 = t1;

  // Invariant: (x0*a+x1*b >= y0*a+y1*b)
  div_res = std::div(x0*a+x1*b, y0*a+y1*b);

Following are few examples obtained by calling ExtendedEuclid(a,b,x,y):
a=25375,b=667 x=1, y=-38, gcd=29 
29 = 25375 * 1 + 667*38

a=17765,b=13481 x=-343, y=452, gcd=17
17 = 17765 * -343 + 13481 * 452

a=19109,b=18321 x=-23, y=24, gcd=197
197 = 19109 * -23 + 18321 * 24

Sunday, May 28, 2017

Efficient polygon area computation with less multiplications.

Linear algebra computations typically try to prefer mathematical formulations which preserve numerical precision, reduce multiplications and avoid divisions if possible. More recently I came across an elegant formulation of polygon area computation with half the number of multiplications relative to the standard method. I will try to briefly derive it for posterity.

Standard Formulation: Let $P = [v_0, v_1, v_2, v_3\ldots v_{n-1}, v_{n}=v_0]$ be a polygon with $n$ vertices, $v_i = (x_i,y_i)$. Recall the standard formula for the area of the polygon $A(P) = \sum_{i=0}^{n-1} (x_{i}y_{i+1} - y_{i}x_{i+1})$. Every term $T(i) = (x_{i}y_{i+1} - y_{i}x_{i+1})$ corresponds to the edge $(v_i, v_{i+1})$ of the polygon -- more precisely this is the signed area of the triangle ${(0,0), (x_i,y_i), (x_{i+1}, y_{i+1})}$. Clearly computation of $A(P) = \sum_{i=0}^{n-1} T(i)$ needs exactly $2n$ multiplication operations.

Alternate Formulation: Consider $T(i) = x_{i}y_{i+1} - y_{i}x_{i+1}$, now add $x_iy_i$ and subtract $x_{i+1}y_{i+1}$ resulting in the following:
D(i) &=& x_iy_i + T(i) - x_{i+1}y_{i+1} \\
D(i) + D(i+1) &=& x_iy_i + T(i) - x_{i+1}y_{i+1} + x_{i+1}y_{i+1} + T(i+1) - x_{i+2}y_{i+2} \\
&=& x_iy_i + T(i) + T(i+1) - x_{i+2}y_{i+2}\\
\sum_{i=0}^{n-1}D(i) &=& x_{0}y_{0} + \sum_{i=0}^{n-1} T(i) - x_ny_n\,\, \mbox{notice that}\, (x_0,y_0) = (x_n,y_n)\\
&=& \sum_{i=0}^{n-1} T(i)

Whats special about $D(i) = x_iy_i + T(i) - x_{i+1}y_{i+1}$ ?. Notice that $D(i)$ can be written as the following:
D(i) &=& x_iy_i + T(i) - x_{i+1}y_{i+1} \\
&=& x_iy_i + (x_iy_{i+1} - x_{i+1}y_i) - x_{i+1}y_{i+1} \\
&=& (x_i + x_{i+1})(y_i - y_{i+1}) \,\, \mbox{notice the only one multiplication}.

Finally we state that the area of the polygon is equivalent to $A(P) = \sum_{i=0}^{n-1} (x_i+x_{i+1})(y_i - y_{i+1})$ and needs exactly $n$ multiplications instead of $2n$ with standard formulation.

Friday, January 20, 2017

An Efficient Algorithm to Determine the Order of a Permutation.

Consider at set $S = \{1,2,3\ldots n\}$ and recall that $S_n$ a set of all permutations of $S$ form a group under composition (<$S_n, *$>). E.g. $S=\{1,2,3,4\}$ , $\pi_1 = [3\,4\,2\,1]$ and $\pi_2 = [4\,1\,3\,2]$. To understand composition operation it may be easier to look at a permutation $\pi \in S_n$ as bijective function $\pi : S \rightarrow S$. The composition $\pi_1 * \pi_2$ would be equivalent to $\pi_1(\pi_2(i)), \forall i\in S$ which results in $\pi_1 * \pi_2 = [1\,3\,2\,4] $. We use the notation $\pi^k = \pi*\pi*\pi\ldots *\pi$ to denote a composition applied $k$ times. Since <$S_n,*$> is a group it must have an identity element $e$ which is a permutation such that $e(i) = i, \forall i\in S$.

Now lets focus on the following question:
Given any permutation $\pi \in S_n$ is there an integer $k \geq 0$ such that $\pi^k = e$ ?

Lets try few examples with $\pi_1 = [3\, 4\, 2\, 1]$, $\pi_1^2 = [2\, 1\, 4\, 3]$, $\pi_1^3 = [4\, 3\, 1\, 2]$, $\pi_1^4 = [1\, 2\, 3\, 4]$. So for this example $\pi_1$ we were able to find $k=4$ such that $\pi_1^4 = e$. Later we will show that finding such a $k$ is always possible for any permutation $\pi$ and such a $k$ is called the order of the permutation.

Recall that every permutation can be decomposed into one or more disjoint cycles. We call a permutation cyclic if it has exactly one cycle in it. E.g. $\pi_c = [3\, 1\, 4\, 2]$ is a cyclic permutation in $S_4$ with cycle being $1 \rightarrow 3 \rightarrow 3 \rightarrow 4 \rightarrow 2 \rightarrow 1$.

Lemma: If $\sigma \in S_n$ is a cyclic permutation then $\sigma^n = e$

Proof: Since $\sigma$ is cyclic all elements $S$ form a directed cycle with $n$ nodes. Starting at any node $i$ we need to traverse $n$ edges to get back to $i$. This can be algebraically expressed as $\underbrace{\sigma(\sigma(\ldots \sigma(i))\ldots)}_{n\text{-times}} =i$

The above lemma states that if a permutation is cyclic on $n$ elements then composing it $n$ times on itself results in an identity permutation. Now we are ready to prove our main theorem:

Theorem: $\forall \pi \in S_n$ there exists a $k \geq 0$ such that $\pi^k = e$.

Proof: Let $\pi = \sigma_1\sigma_2\ldots \sigma_m$ be the cycle decomposition of $\pi$. Let $|\sigma_i|$ denote size of cycle. Since each of $\sigma_i$ is cyclic permutation in $S_{|\sigma_i|}$ by the previous lemma $\sigma_i^{|\sigma_i|} = e_{|\sigma_i|}$. Where $e_{|\sigma_i|}$ is an identity permutation in $S_{|\sigma_i|}$. Also notice that any multiple of $|\sigma_i|$ is also going to result in an identity permutation (i.e. for any integer $q \geq 1$ , $\sigma_i^{q|\sigma_i|} = e_{|\sigma_i|}$). Let $t = |\sigma_1|\times |\sigma_2|\times \ldots |\sigma_m|$ then $\pi^t = \sigma_1^t\sigma_2^t\ldots \sigma_m^t$ since $t$ is a multiple of $|\sigma_i|, 1\leq i \leq m$ we have $\pi^t = e_{|\sigma_1|}e_{|\sigma_2|}\ldots e_{|\sigma_m|} = e$

From the above proof its clear that choosing $k = |\sigma_1|\times |\sigma_2|\ldots |\sigma_m|$ which is a multiple of all cycle lengths guarantees that $\pi^k = e$. However from a computational efficiency purpose we are much better if we let that multiple be the least common multiple of $|\sigma_i|,\, 1\leq i\leq m$. Following is an efficient algorithm -- linear time and takes at most $n$ l.c.m computations -- to find the order of any permutation:

// Finds the order of a valid permutation ( 1 \leq perm[i] \leq n).
size_t FindPermutationOrder(const std::vector< unsigned int > &perm) {
  size_t n = perm.size();
  size_t perm_order = 1; // This overflows if the l.c.m does not fit in 64-bits.
  std::vector< bool > cycle_mask(n, true);

  // Find the disjoint cycle lengths in the permutation.
  size_t cycle_index = 0;
  while (cycle_index < n) {
    // Find the start of the next disjoint cycle.
    if (cycle_mask[cycle_index]) {
      size_t i = cycle_index;
      size_t cycle_length = 0;
      do {
        cycle_mask[i] = false;
        i = perm[i] - 1;
      } while (i != cycle_index);
      perm_order = boost::math::lcm(perm_order, cycle_length);
  return perm_order;
Following are few examples on the algorithm:
Order of [  5  3  1  2  4  ] is 5
Order of [  3  4  2  1  ] is 4
Order of [  1  2  3  4  5  ] is 1
Order of [  7  8  9  10  1  2  4  3  6  5  ] is 5

Thursday, December 22, 2016

Simpler Expected Analysis of Randomized Selection Algorithm.

In this post I present a simpler proof to derive expected asymptotic bounds for the Randomized Selection algorithm -- Las Vegas version to be more precise. Given a set $S = \{ a_1\,a_2\,a_3\ldots a_n \}$ of keys and an integer $k \in [1,n]$ the selection problem asks to report the $k^{th}$ largest key in the set $S$. Recall that this problem can be solved optimally using $O(n)$ comparison operations using the BFPRT (Blum, Floyd, Pratt, Rivest and Tarjan 1973) algorithm. However a correct implementation of BFPRT algorithm may take a lot of code. On the other hand randomized algorithms for many deterministic algorithms are often simpler to implement. The algorithm chooses a random pivot for partitioning the input and recursively picks on of the partition until the partition (sub problem) size is just one. Following is an implementation of RandSelect in around 25 lines.

Our goal here is the analyze the expected run-time and expected number of steps for the problem size converge to unity -- more precisely the expected value of the variable rand_walk_steps in the following code. I have purposefully kept the analysis very rigorous to make it accessible to everyone.

TL;DR "Expected number of comparisons in RandSelect is $\Theta(n)$ and expected number of rand_walk_steps is $\Theta(\log(n))$".

// Randomized selection algorithm to pick k^th largest element
// in a mutable array.
template < typename T > 
T RandSelect(T *array, size_t n, size_t k) {
  assert(k > 0 && k <= n); 
  size_t problem_size = n, select = k;

  // What is the expected value of rand_walk_steps ??
  size_t rand_walk_steps = 0;  

  while (problem_size > 1) {
    size_t i = 0, j = problem_size - 1;
    size_t ridx = RandArrayIndex(problem_size);
    const T pivot = array[ridx];

    // Partition the sub problem with pivot.
    while (i <= j) {
      if (array[i] <= pivot) {
      } else if (array[j] > pivot) {
      } else {
        std::swap(array[i], array[j]);

    problem_size = (i >= select) ? i : problem_size - i;
    array = (i >= select) ? array : &array[i];
    select = (i >= select) ? select : select - i;
  return array[0];

Let $X_i$ be the random variable corresponding to problem size in $i^{th}$ step ($1 \geq i \geq n-1$) in the algorithm -- each step corresponds to one iteration of the outer most while loop. If the algorithm terminates in $j$ steps then all $X_k$ with $k>j$ are set to $0$.
X_i &=& \text{Random variable for problem size in }i^{th}\text{ iteration of the outer most}\textsf{ while}\text{ loop} \\
X_i && \left\{ \begin{array}{lr} \in [1,X_{i-1}-1] & \textsf{If } X_{i-1} > 1 \\
= 0 & \textsf{Otherwise}
\end{array} \right. \\
X_0 &=& n \,\,\,\,\,\,\, \text{Initial problem size and }\,\,\, E[X_0] = n \\
&& \\

Lemma-1: If $X_{i-1} >1$ Then $E[X_i | X_{i-1}] = \frac{X_{i-1}}{2}$

Proof: For a given $X_{i-1} > 1$ the random variable $X_{i}$ is uniformly distributed in the range $[1,X_{i-1}-1]$. Hence the expected value is simply the average $\frac{1 + (X_{i-1}-1)}{2} = \frac{X_{i-1}}{2}$ $\Box$.

Lemma-2: $E[X_i] = \frac{n}{2^i}$

&E[X_i|X_{i-1}] &=& \frac{X_{i-1}}{2}\,\,\,\,\,\text{From Lemma-1}\\
&E[\,E[X_i|X_{i-1}]\,] &=& E[\frac{X_{i-1}}{2}] \\
\Rightarrow & E[X_i] &=& E[\frac{X_{i-1}}{2}] = \frac{E[X_{i-1}]}{2}\,\,\,\,\,\text{for any two random variables } X,Y\,\,E[E[X|Y]] = E[X]\\
&&=&\frac{1}{2}\times E[X_{i-1}] = \frac{1}{2^2}\times E[X_{i-2}] \ldots \,\,\,\,\, \text{Apply recursively} \\
&&\ldots& \\
&&=& \frac{1}{2^i}\times E[X_{i-i}] = \frac{E[X_0]}{2^i} = \frac{n}{2^i} \,\, \Box\\

Theorem-1: Expected number of comparisons done by RandSelect is $\Theta(n)$.

Proof: The number comparisons done by the algorithm in each iteration is $\leq 3\times X_i$. We at most $2$ comparisons in the inner if..else branch and $1$ comparison to test $i <= j$, so a total of $3$ comparisons in the each iteration of the inner most while loop. The loop runs at most $X_i$ times in the $i^{th}$ iteration of the outer most while loop. Finally we need $1$ comparison to test the termination of the while loop. Let $T$ be the random variable corresponding the total number of comparisons in the algorithm. So we need to find $E[T]$.
&T \leq 3(X_0 + X_1 + X_2 + \ldots X_{n-1}) + 1 & \text{Total number of comparisons in the Algorithm. }\\
\Rightarrow& \sum_{i=0}^{n-1} X_i \leq T \leq 3\sum_{i=0}^{n-1} X_{i} + 1 & \\
& \sum_{i=0}^{n-1} E[X_i] \leq E[T] \leq 3\sum_{i=0}^{n-1} E[X_{i}] + 1 & \text{Linearity of expectation.} \\
& \sum_{i=0}^{n-1} \frac{n}{2^i} \leq E[T] \leq 3\sum_{i=0}^{n-1} \frac{n}{2^i} + n & \text{Apply Lemma-2.} \\
& 2n\left(1-\frac{1}{2^n}\right) \leq E[T] \leq 6n\left(1-\frac{1}{2^n}\right) +n & \text{Use: } \sum_{i=0}^{n-1}\frac{1}{2^i} = \frac{1-1/2^n}{1-1/2}.\\
\Rightarrow &n \leq E[T] < 6n + 1 = 6n +1 & \text{Use: for } n\geq 1\,\,\,,\frac{1}{2} \leq \left(1-\frac{1}{2^n}\right) < 1 \\ &&\\ \Rightarrow & E[T] = \Theta(n)\,\,\Box & \\ \end{array} $$

We now return the question of the expected number of steps for the problem size to reach unity (i.e. expected number of iterations of the outer most while loop). This is also referred as the expected number of recursive calls in Motwani and Raghavan's book. However the following analysis is much simpler than the application of a calculus based lemma based on random walks in the book. Let $Y_{i}$ be a random variable corresponding to the number of iterations starting with a problem size of $X_{i} > 1$.
Y_{i} &:=&\text{random variable for the number of steps with problem size }\,\, X_{i} >1 \\
Y_i &=& 1 + Y_{i+1} \,\,\,\,\,\, \text{(Recursive formulation)}\\
Y = Y_0 &=& \underbrace{1 + 1 + 1 \dots + 1}_{k \text{-times}} + Y_{k} \,\,\,\,\,\, \text{(Expand } k \text{ times)}\\
&& \\

Theorem-2: Expected number of steps for the problem size to reach unity, $E[Y] = \Theta(\log(n))$.

&Y_i \leq X_{i}\,\,\,\, \text{ (Number of steps are at most problem size at } i^{th} \text{ step which is } X_i)\\
& \\
\Rightarrow &E[Y_i] \leq E[X_i] \,\,\,\, \text{( Let } Z = X_i-Y_i\,\,\,\,,\,\,\,\, E[Z] = E[X_i]-E[Y_i] \geq 0 \text{)}\\
&E[Y_i] \leq E[X_i] = \frac{n}{2^i} \,\,\,\, \text{ (Apply Lemma-2)} \\
&Y = Y_0 = \underbrace{1 + 1 + 1 \dots + 1}_{t \text{-times}} + Y_{t} \\
\Rightarrow & E[Y] = \underbrace{E[1] + E[1] \dots + E[1]}_{t \text{-times}} + E[Y_{t}] = t + E[Y_{t}] \\
&E[Y] = t + E[Y_{t}] \leq t + E[X_{t}] \leq t + \frac{n}{2^t} \\
\Rightarrow &t \leq E[Y] \leq t + \frac{n}{2^t} \,\,\,\, \text{After } t \text{ steps }\\
& \\
&\text{When } t = \log_2(n) \text{ the above inequality becomes the following } \\
& \log_2(n) \leq E[Y] \leq \log_2(n) + \frac{n}{2^{log_2(n)}} = \log_2(n) +1 \\

\Rightarrow & E[Y] =\Theta(log(n)) \,\, \Box \\

This completes our expected analysis of the RandSelect algorithm. The analysis just uses elementary probability theory without relying on a Calculus based random walk theorem in the "Randomized Algorithms" book by Motwani and Raghavan -- (Theorem-1.3 on page-15).

Thursday, November 24, 2016

Integer Division Algorithm with no Division or Multiplication Operators.

I was working through a number theory problem today and I needed an efficient algorithm to find a largest integer $n$ s.t $a\times n \leq b$ for any given integers $a, b$ (w.l.o.g $a < b$). Clearly $n$ is the quotient when $b$ is divided by $a$. Following is an algorithm which can compute the quotient and reminder in $\Theta(\log(a) + \log(b))$ operations without using multiplication or division operators. In fact if the number of bits in $a$ and $b$ are known then the algorithm runs in $\Theta(\log(a) - \log(b))$ operations.

typedef unsigned int uint;

// Integer division of b with a without using division
// or multiplication operators.
div_t integer_div(uint a, uint b) {
  div_t result;
  uint bit_diff = bit_count(a) - bit_count(b);
  uint prod = 0, prod_shift = b << bit_diff, shift = 1 << bit_diff;

  result.quot = 0;
  do {
    if (prod + prod_shift <= a) {
      prod += prod_shift;
      result.quot += shift;
    shift >>= 1;
    prod_shift >>= 1;
  } while (shift);
  result.rem = a - prod;

  return result;

// utility function to find number of bits
// required to represent the value of a.
uint bit_count(uint a) {
  unsigned int i = 0;
  do {
  } while ((a = a >> 1));
  return i;

The above algorithm was tested on $4294967293$ pairs of $(a,b), a,b \in [1, INT\_MAX]$ and compared with std::div -- running all these combinations just take around $2$ minutes on a mac book pro. It makes sense because that $\log(a) \leq 32$ (word size on the machine) hence we were just running around four billion constant operations. Typically in all algorithm analysis the word size is considered a $O(1)$, so as long as the inputs bit-widths are within the word size of the machine the runtime of the algorithm can be considered $O(1)$.

Notice that the algorithm is greedy in nature and will provide a correctness next.

Let $c=\sum_{i=0}^{\log(a)-\log(b)}t_i\times 2^{i}$ and $t_{\log(a)-\log(b)}t_{\log(a)-\log(b)-1}\ldots t_0$ the bit representation of $c$. The algorithm tries to pick the powers of $2$ in a greedy manner (i.e. $2^{\log(a)-\log(b)}\geq 2^{\log(a)-\log(b)-1} \geq \ldots 2^{0}$). Consider the choice of $i^{th}$ bit of $c$, if $2^{i}\times a \leq b$ then any $i$-bit integer with a $0$ in the $i^{th}$ bit is strictly less than $2^{i}$. Since $2^{i-1} + 2^{i-2} + \ldots 2^{0} = \frac{2^{i-1+1} -1}{2-1} = 2^{i}-1 < 2^{i}$. We can use induction of $i$ to complete the proof. Hence we can claim that if we greedily pick the powers of $2$ (i.e ${2^i\,\,| \log(a)-\log(b)\leq i \leq 0 }$) it will yield the largest integer $c$ s.t $a\times c \leq b$. The integer division algorithm is just corollary of this basic fact.

Saturday, June 11, 2016

What kind of strings have Kolmogorov complexity equal to their length ?

I have been doing some reading about Information Theory and Kolmogorov complexity lately. I just thought about the following problem. Consider a language of all C programs (strings) which print themselves, formally Let $$L_{self} = \{ w\,\, | w\,\, \mbox{is a C program which prints itself} \}$$.
Following is an example of a C program which prints itself (on a side-note the trick behind the construction is to use the ASCII values for quotes and newlines). Clearly the Kolmogorov complexity of each w is less than or equal to the length of w -- given that we have constructed a program of length same as the string w.
I'm wondering if this is indeed optimal ? -- that is given a w, if its possible to produce a program to generate w and size strictly less than |w|.

char qt=34;
char nl=10;
char *str="char qt=34;%cchar nl=10;%cchar *str=%c%s%c;%cint main() {%c printf(str,nl,nl,qt,str,qt,nl,nl,nl,nl);%c}%c";
int main() {

BTW, compile the program with
gcc -w 
to suppress warnings about header inclusion.

Saturday, March 07, 2015

Some Special Cases Of the Sequence Assembly Problem.

Sequence Assembly (SA) is a well studied problem in Bioinformatics. To quickly summarize the problem: let $\Sigma$ be some alphabet and $S =\{ s_1 , s_2 \ldots s_n\}$ be a set of strings from $\Sigma$. The goal of the SA problem is re-construct a string $s_a$ such that every $s_i \in S$ is a sub-string in $s_a$. Although its very tempting to reduce the SA problem to the Shortest Super String Problem (which happens to be $\mathcal{NP}$-hard), the repeats in genome makes the SA problem complex and ambiguous.

Any way one of my friends showed me this problem yesterday, which is a much simpler special case of the SA problem. Let $s_{\pi}$ be a some permutation (unknown) of $\Sigma$. Now given a set $S' = \{s'_1,s'_2\ldots s'_m\}$ of sub-sequences of $s_{\pi}$, we want to re-construct this un-known $s_{\pi}$. Just like the SA problem for a given $S'$ there could be several answers for $s_{\pi}$ -- in which case we want to find out the lexicographically smallest permutation. If $\Sigma_i |s'_i| = N$ the problem could be solved efficiently by reducing it to a topological sorting algorithm on a DAG. Following is a quick solution to this problem I wrote while watching the RSA vs PAK world cup cricket match -- I'm little disappointed that RSA lost despite a ferocious knock by AB. The construction of the DAG from $S'$ takes $O(Nlog(N)$ operations, however the cost of topological sorting to report the answer is still $O(N)$ operations.

// 03/06/2015: vamsik (at) engr.uconn.edu//
#include <cmath>
#include <cstdio>
#include <map>
#include <list>
#include <unordered_map>
#include <iostream>
#include <algorithm>
#include <cassert>
using namespace std;

struct CDirectedGraph{
    map<unsigned int, map<unsigned int, bool> > m_aEdges;
    unordered_map<unsigned int, unsigned int> m_aInDegree;
    inline CDirectedGraph() : m_aEdges(), m_aInDegree() {}
    inline void add_edge(unsigned int i, unsigned int j){
        if((m_aEdges[i]).find(j) == (m_aEdges[i]).end()){
            if(m_aInDegree.find(j) == m_aInDegree.end()){
                m_aInDegree[j] = 0;
            (m_aEdges[i])[j] = true;
    inline void top_sort(){
        map<unsigned int, bool> snodes;
        unsigned int v;
        for(auto vItr=m_aEdges.begin(); vItr!=m_aEdges.end(); ++vItr){
                snodes[vItr->first] = true;
            auto vItr = snodes.begin();
            v = vItr->first;
            auto eEnd = m_aEdges[v].end();
            auto eBeg = m_aEdges[v].begin();
            for(auto eItr=eBeg;  eItr!=eEnd; ++eItr){
                    snodes[eItr->first] = true;
            printf("%u ", v);

int main() {
    unsigned int N;
    unsigned int K, pnode, cnode;
    scanf("%u", &N);
    CDirectedGraph diGraph;
    for(unsigned int i=0; i<N; i++){
        scanf("%u", &K);
        scanf("%u", &pnode);
        for(unsigned int j=0; j<K-1; j++){
            scanf("%u", &cnode);
            diGraph.add_edge(pnode, cnode);
            pnode = cnode;
    return 0;