Sequence Assembly (SA) is a well studied problem in Bioinformatics. To quickly summarize the problem: let \Sigma be some alphabet and S =\{ s_1 , s_2 \ldots s_n\} be a set of strings from \Sigma. The goal of the SA problem is re-construct a string s_a such that every s_i \in S is a sub-string in s_a. Although its very tempting to reduce the SA problem to the Shortest Super String Problem (which happens to be \mathcal{NP}-hard), the repeats in genome makes the SA problem complex and ambiguous.
Any way one of my friends showed me this problem yesterday, which is a much simpler special case of the SA problem. Let s_{\pi} be a some permutation (unknown) of \Sigma. Now given a set S' = \{s'_1,s'_2\ldots s'_m\} of sub-sequences of s_{\pi}, we want to re-construct this un-known s_{\pi}. Just like the SA problem for a given S' there could be several answers for s_{\pi} -- in which case we want to find out the lexicographically smallest permutation. If \Sigma_i |s'_i| = N the problem could be solved efficiently by reducing it to a topological sorting algorithm on a DAG. Following is a quick solution to this problem I wrote while watching the RSA vs PAK world cup cricket match -- I'm little disappointed that RSA lost despite a ferocious knock by AB. The construction of the DAG from S' takes O(Nlog(N) operations, however the cost of topological sorting to report the answer is still O(N) operations.
// 03/06/2015: vamsik (at) engr.uconn.edu// #include <cmath> #include <cstdio> #include <map> #include <list> #include <unordered_map> #include <iostream> #include <algorithm> #include <cassert> using namespace std; struct CDirectedGraph{ map<unsigned int, map<unsigned int, bool> > m_aEdges; unordered_map<unsigned int, unsigned int> m_aInDegree; inline CDirectedGraph() : m_aEdges(), m_aInDegree() {} inline void add_edge(unsigned int i, unsigned int j){ if((m_aEdges[i]).find(j) == (m_aEdges[i]).end()){ if(m_aInDegree.find(j) == m_aInDegree.end()){ m_aInDegree[j] = 0; } m_aInDegree[j]++; (m_aEdges[i])[j] = true; } } inline void top_sort(){ map<unsigned int, bool> snodes; unsigned int v; for(auto vItr=m_aEdges.begin(); vItr!=m_aEdges.end(); ++vItr){ if(!m_aInDegree[vItr->first]){ snodes[vItr->first] = true; } } while(!snodes.empty()){ auto vItr = snodes.begin(); v = vItr->first; auto eEnd = m_aEdges[v].end(); auto eBeg = m_aEdges[v].begin(); for(auto eItr=eBeg; eItr!=eEnd; ++eItr){ m_aInDegree[eItr->first]--; if(!m_aInDegree[eItr->first]){ snodes[eItr->first] = true; } } printf("%u ", v); snodes.erase(vItr); } } }; int main() { unsigned int N; unsigned int K, pnode, cnode; scanf("%u", &N); CDirectedGraph diGraph; for(unsigned int i=0; i<N; i++){ scanf("%u", &K); scanf("%u", &pnode); for(unsigned int j=0; j<K-1; j++){ scanf("%u", &cnode); diGraph.add_edge(pnode, cnode); pnode = cnode; } } diGraph.top_sort(); return 0; }