## Saturday, March 07, 2015

### Some Special Cases Of the Sequence Assembly Problem.

Sequence Assembly (SA) is a well studied problem in Bioinformatics. To quickly summarize the problem: let $\Sigma$ be some alphabet and $S =\{ s_1 , s_2 \ldots s_n\}$ be a set of strings from $\Sigma$. The goal of the SA problem is re-construct a string $s_a$ such that every $s_i \in S$ is a sub-string in $s_a$. Although its very tempting to reduce the SA problem to the Shortest Super String Problem (which happens to be $\mathcal{NP}$-hard), the repeats in genome makes the SA problem complex and ambiguous.

Any way one of my friends showed me this problem yesterday, which is a much simpler special case of the SA problem. Let $s_{\pi}$ be a some permutation (unknown) of $\Sigma$. Now given a set $S' = \{s'_1,s'_2\ldots s'_m\}$ of sub-sequences of $s_{\pi}$, we want to re-construct this un-known $s_{\pi}$. Just like the SA problem for a given $S'$ there could be several answers for $s_{\pi}$ -- in which case we want to find out the lexicographically smallest permutation. If $\Sigma_i |s'_i| = N$ the problem could be solved efficiently by reducing it to a topological sorting algorithm on a DAG. Following is a quick solution to this problem I wrote while watching the RSA vs PAK world cup cricket match -- I'm little disappointed that RSA lost despite a ferocious knock by AB. The construction of the DAG from $S'$ takes $O(Nlog(N)$ operations, however the cost of topological sorting to report the answer is still $O(N)$ operations.

// 03/06/2015: vamsik (at) engr.uconn.edu//
#include <cmath>
#include <cstdio>
#include <map>
#include <list>
#include <unordered_map>
#include <iostream>
#include <algorithm>
#include <cassert>
using namespace std;

struct CDirectedGraph{
map<unsigned int, map<unsigned int, bool> > m_aEdges;
unordered_map<unsigned int, unsigned int> m_aInDegree;

inline CDirectedGraph() : m_aEdges(), m_aInDegree() {}

inline void add_edge(unsigned int i, unsigned int j){
if((m_aEdges[i]).find(j) == (m_aEdges[i]).end()){
if(m_aInDegree.find(j) == m_aInDegree.end()){
m_aInDegree[j] = 0;
}
m_aInDegree[j]++;
(m_aEdges[i])[j] = true;
}
}

inline void top_sort(){
map<unsigned int, bool> snodes;
unsigned int v;
for(auto vItr=m_aEdges.begin(); vItr!=m_aEdges.end(); ++vItr){
if(!m_aInDegree[vItr->first]){
snodes[vItr->first] = true;
}
}

while(!snodes.empty()){
auto vItr = snodes.begin();
v = vItr->first;
auto eEnd = m_aEdges[v].end();
auto eBeg = m_aEdges[v].begin();

for(auto eItr=eBeg;  eItr!=eEnd; ++eItr){
m_aInDegree[eItr->first]--;
if(!m_aInDegree[eItr->first]){
snodes[eItr->first] = true;
}
}
printf("%u ", v);
snodes.erase(vItr);
}
}
};

int main() {
unsigned int N;
unsigned int K, pnode, cnode;
scanf("%u", &N);
CDirectedGraph diGraph;

for(unsigned int i=0; i<N; i++){
scanf("%u", &K);
scanf("%u", &pnode);
for(unsigned int j=0; j<K-1; j++){
scanf("%u", &cnode);