De Bruijn graphs recently found applications in
Sequence Assembly(
).
problem is close to
Shortest Common super String(
) problem, however
there are some fundamental differences. Given a set of input strings
strings the
problem outputs a string
such that every
is a sub-string in
and
. On the other hand the input to the
problem
is a set
of strings, every
is a randomly chosen
sub-string of a bigger string
- called the
genome. Unfortunately we do not
know what
is, so given
the
problem asks to find the
such that we can
explain every
. Adding to the complexity of
problem
may contain several
repeating regions and the direct appliction of
to solve
problem would
result in collapsing the repeating regions in
- so we cannot directly reduce
the
problem to
. Also the string
is special string coming from the
so its double stranded and each
may be originating from the forward or
reverse compliments of
.
A De Bruijn graph on a set of strings is a graph whose vertices
correspond to the unique mers (a string of length ) from all the
and whose edges correspond to a some mer in some . Also to consider
the double stranded-ness we also include a reverse compliment of every mer into
the graph . In the following simple example I show how a loop (i.e. both
the forward and reverse compliments overlapping each other by ) can originate
into . This interesting situation seems to be referred to as hairpin-loop.