Sequence Assembly (SA) is a well studied problem in Bioinformatics. To quickly summarize the problem: let $\Sigma$ be some alphabet and $S =\{ s_1 , s_2 \ldots s_n\}$ be a set of strings from $\Sigma$. The goal of the SA problem is re-construct a string $s_a$ such that every $s_i \in S$ is a sub-string in $s_a$. Although its very tempting to reduce the SA problem to the Shortest Super String Problem (which happens to be $\mathcal{NP}$-hard), the repeats in genome makes the SA problem complex and ambiguous.
Any way one of my friends showed me this problem yesterday, which is a much simpler special case of the SA problem. Let $s_{\pi}$ be a some permutation (unknown) of $\Sigma$. Now given a set $S' = \{s'_1,s'_2\ldots s'_m\}$ of sub-sequences of $s_{\pi}$, we want to re-construct this un-known $s_{\pi}$. Just like the SA problem for a given $S'$ there could be several answers for $s_{\pi}$ -- in which case we want to find out the lexicographically smallest permutation. If $\Sigma_i |s'_i| = N$ the problem could be solved efficiently by reducing it to a topological sorting algorithm on a DAG. Following is a quick solution to this problem I wrote while watching the RSA vs PAK world cup cricket match -- I'm little disappointed that RSA lost despite a ferocious knock by AB. The construction of the DAG from $S'$ takes $O(Nlog(N)$ operations, however the cost of topological sorting to report the answer is still $O(N)$ operations.
// 03/06/2015: vamsik (at) engr.uconn.edu// #include <cmath> #include <cstdio> #include <map> #include <list> #include <unordered_map> #include <iostream> #include <algorithm> #include <cassert> using namespace std; struct CDirectedGraph{ map<unsigned int, map<unsigned int, bool> > m_aEdges; unordered_map<unsigned int, unsigned int> m_aInDegree; inline CDirectedGraph() : m_aEdges(), m_aInDegree() {} inline void add_edge(unsigned int i, unsigned int j){ if((m_aEdges[i]).find(j) == (m_aEdges[i]).end()){ if(m_aInDegree.find(j) == m_aInDegree.end()){ m_aInDegree[j] = 0; } m_aInDegree[j]++; (m_aEdges[i])[j] = true; } } inline void top_sort(){ map<unsigned int, bool> snodes; unsigned int v; for(auto vItr=m_aEdges.begin(); vItr!=m_aEdges.end(); ++vItr){ if(!m_aInDegree[vItr->first]){ snodes[vItr->first] = true; } } while(!snodes.empty()){ auto vItr = snodes.begin(); v = vItr->first; auto eEnd = m_aEdges[v].end(); auto eBeg = m_aEdges[v].begin(); for(auto eItr=eBeg; eItr!=eEnd; ++eItr){ m_aInDegree[eItr->first]--; if(!m_aInDegree[eItr->first]){ snodes[eItr->first] = true; } } printf("%u ", v); snodes.erase(vItr); } } }; int main() { unsigned int N; unsigned int K, pnode, cnode; scanf("%u", &N); CDirectedGraph diGraph; for(unsigned int i=0; i<N; i++){ scanf("%u", &K); scanf("%u", &pnode); for(unsigned int j=0; j<K-1; j++){ scanf("%u", &cnode); diGraph.add_edge(pnode, cnode); pnode = cnode; } } diGraph.top_sort(); return 0; }
No comments:
Post a Comment