Saturday, March 07, 2015

Some Special Cases Of the Sequence Assembly Problem.

Sequence Assembly (SA) is a well studied problem in Bioinformatics. To quickly summarize the problem: let $\Sigma$ be some alphabet and $S =\{ s_1 , s_2 \ldots s_n\}$ be a set of strings from $\Sigma$. The goal of the SA problem is re-construct a string $s_a$ such that every $s_i \in S$ is a sub-string in $s_a$. Although its very tempting to reduce the SA problem to the Shortest Super String Problem (which happens to be $\mathcal{NP}$-hard), the repeats in genome makes the SA problem complex and ambiguous.

Any way one of my friends showed me this problem yesterday, which is a much simpler special case of the SA problem. Let $s_{\pi}$ be a some permutation (unknown) of $\Sigma$. Now given a set $S' = \{s'_1,s'_2\ldots s'_m\}$ of sub-sequences of $s_{\pi}$, we want to re-construct this un-known $s_{\pi}$. Just like the SA problem for a given $S'$ there could be several answers for $s_{\pi}$ -- in which case we want to find out the lexicographically smallest permutation. If $\Sigma_i |s'_i| = N$ the problem could be solved efficiently by reducing it to a topological sorting algorithm on a DAG. Following is a quick solution to this problem I wrote while watching the RSA vs PAK world cup cricket match -- I'm little disappointed that RSA lost despite a ferocious knock by AB. The construction of the DAG from $S'$ takes $O(Nlog(N)$ operations, however the cost of topological sorting to report the answer is still $O(N)$ operations.

// 03/06/2015: vamsik (at) engr.uconn.edu//
#include <cmath>
#include <cstdio>
#include <map>
#include <list>
#include <unordered_map>
#include <iostream>
#include <algorithm>
#include <cassert>
using namespace std;


struct CDirectedGraph{
    map<unsigned int, map<unsigned int, bool> > m_aEdges;
    unordered_map<unsigned int, unsigned int> m_aInDegree;
    
    
    inline CDirectedGraph() : m_aEdges(), m_aInDegree() {}
    
    inline void add_edge(unsigned int i, unsigned int j){
        if((m_aEdges[i]).find(j) == (m_aEdges[i]).end()){
            if(m_aInDegree.find(j) == m_aInDegree.end()){
                m_aInDegree[j] = 0;
            }
            m_aInDegree[j]++;
            (m_aEdges[i])[j] = true;
        }
    }
    
    inline void top_sort(){
        map<unsigned int, bool> snodes;
        unsigned int v;
        for(auto vItr=m_aEdges.begin(); vItr!=m_aEdges.end(); ++vItr){
            if(!m_aInDegree[vItr->first]){
                snodes[vItr->first] = true;
            }
        }
        
        while(!snodes.empty()){
            auto vItr = snodes.begin();
            v = vItr->first;
            auto eEnd = m_aEdges[v].end();
            auto eBeg = m_aEdges[v].begin();
            
            for(auto eItr=eBeg;  eItr!=eEnd; ++eItr){
                m_aInDegree[eItr->first]--;
                if(!m_aInDegree[eItr->first]){
                    snodes[eItr->first] = true;
                }
            }
            printf("%u ", v);
            snodes.erase(vItr);
        }
    }
};

int main() {
    unsigned int N;
    unsigned int K, pnode, cnode;
    scanf("%u", &N);
    CDirectedGraph diGraph;
    
    for(unsigned int i=0; i<N; i++){
        scanf("%u", &K);
        scanf("%u", &pnode);
        for(unsigned int j=0; j<K-1; j++){
            scanf("%u", &cnode);
            diGraph.add_edge(pnode, cnode);
            pnode = cnode;
        }
    }
    diGraph.top_sort();
    return 0;
}