De Bruijn graphs recently found applications in
Sequence Assembly(
data:image/s3,"s3://crabby-images/16581/16581b23c47dbfc1a43cdd9d6bb6798878044a74" alt="$ SA$"
).
data:image/s3,"s3://crabby-images/16581/16581b23c47dbfc1a43cdd9d6bb6798878044a74" alt="$ SA$"
problem is close to
Shortest Common super String(
data:image/s3,"s3://crabby-images/987f1/987f1481afc84e366846a92537c32851ba01cf1f" alt="$ SCS$"
) problem, however
there are some fundamental differences. Given a set of input strings
data:image/s3,"s3://crabby-images/0aa2b/0aa2b0b0f2048255d2f8a8ae10a504856628cdc0" alt="$ S=\{s_1,s_2,s_3\ldots s_n\}$"
strings the
data:image/s3,"s3://crabby-images/987f1/987f1481afc84e366846a92537c32851ba01cf1f" alt="$ SCS$"
problem outputs a string
data:image/s3,"s3://crabby-images/c23a6/c23a684d9c37ad6cab92adc665e4477760d992f7" alt="$ s^*$"
such that every
data:image/s3,"s3://crabby-images/03fb2/03fb237b043ed9422113ba40f5e3578560743d9f" alt="$ s_i$"
is a sub-string in
data:image/s3,"s3://crabby-images/c23a6/c23a684d9c37ad6cab92adc665e4477760d992f7" alt="$ s^*$"
and
data:image/s3,"s3://crabby-images/7b21f/7b21f96a8b4918d3b42ded7c22f4fc3eab1ab9ac" alt="$ \nexists s': length(s') < length(s^*)$"
. On the other hand the input to the
data:image/s3,"s3://crabby-images/16581/16581b23c47dbfc1a43cdd9d6bb6798878044a74" alt="$ SA$"
problem
is a set
data:image/s3,"s3://crabby-images/0533d/0533d42da0a1d914a33a308239dc24de518fad93" alt="$ R=\{r_1,r_2,r_3\ldots r_n\}$"
of strings, every
data:image/s3,"s3://crabby-images/e88c9/e88c9be50b74b9471da22e098d265f1ff908c5e2" alt="$ r_i$"
is a randomly chosen
sub-string of a bigger string
data:image/s3,"s3://crabby-images/7ec78/7ec7864fdae418d4b8e27f1e327903e38b742ab7" alt="$ G$"
- called the
genome. Unfortunately we do not
know what
data:image/s3,"s3://crabby-images/7ec78/7ec7864fdae418d4b8e27f1e327903e38b742ab7" alt="$ G$"
is, so given
data:image/s3,"s3://crabby-images/3752e/3752e0d15fe4d43ad017cd44065c459e84430fe2" alt="$ R$"
the
data:image/s3,"s3://crabby-images/16581/16581b23c47dbfc1a43cdd9d6bb6798878044a74" alt="$ SA$"
problem asks to find the
data:image/s3,"s3://crabby-images/7ec78/7ec7864fdae418d4b8e27f1e327903e38b742ab7" alt="$ G$"
such that we can
explain every
data:image/s3,"s3://crabby-images/e88c9/e88c9be50b74b9471da22e098d265f1ff908c5e2" alt="$ r_i$"
. Adding to the complexity of
data:image/s3,"s3://crabby-images/16581/16581b23c47dbfc1a43cdd9d6bb6798878044a74" alt="$ SA$"
problem
data:image/s3,"s3://crabby-images/7ec78/7ec7864fdae418d4b8e27f1e327903e38b742ab7" alt="$ G$"
may contain several
repeating regions and the direct appliction of
data:image/s3,"s3://crabby-images/987f1/987f1481afc84e366846a92537c32851ba01cf1f" alt="$ SCS$"
to solve
data:image/s3,"s3://crabby-images/16581/16581b23c47dbfc1a43cdd9d6bb6798878044a74" alt="$ SA$"
problem would
result in collapsing the repeating regions in
data:image/s3,"s3://crabby-images/7ec78/7ec7864fdae418d4b8e27f1e327903e38b742ab7" alt="$ G$"
- so we cannot directly reduce
the
data:image/s3,"s3://crabby-images/16581/16581b23c47dbfc1a43cdd9d6bb6798878044a74" alt="$ SA$"
problem to
data:image/s3,"s3://crabby-images/987f1/987f1481afc84e366846a92537c32851ba01cf1f" alt="$ SCS$"
. Also the string
data:image/s3,"s3://crabby-images/7ec78/7ec7864fdae418d4b8e27f1e327903e38b742ab7" alt="$ G$"
is special string coming from the
data:image/s3,"s3://crabby-images/e8541/e854139961ac0b23fc34eabe5062cfa26de4846a" alt="$ DNA$"
so its double stranded and each
data:image/s3,"s3://crabby-images/e88c9/e88c9be50b74b9471da22e098d265f1ff908c5e2" alt="$ r_i$"
may be originating from the forward or
reverse compliments of
data:image/s3,"s3://crabby-images/7ec78/7ec7864fdae418d4b8e27f1e327903e38b742ab7" alt="$ G$"
.
A
De Bruijn graph
on a set of strings
is a graph whose vertices
correspond to the unique
mers (a string of length
) from all the
and whose edges correspond to a some
mer in some
. Also to consider
the double stranded-ness we also include a reverse compliment of every
mer into
the graph
. In the following simple example I show how a loop (i.e. both
the forward and reverse compliments overlapping each other by
) can originate
into
. This interesting situation seems to be referred to as hairpin-loop.