Thursday, December 24, 2009

[TECH] An interesting problem on the mailing list []

Some one asked the following question of GDB mailing list.

  I was wondering how come the call to "call cos(.2)" returns the wrong
answer in:

#include "math.h"
double    (*fcn)(double);
  double t;
  fcn = cos;

(gdb) call cos(.2)
$16 = 104
(gdb) call fcn(.2)
$17 = 0.98006657784124163

View this message in context: 
Sent from the Gnu - gdb - General mailing list archive at
Here is my analysis.

Looks like 'gdb' is missing debug information about the function 'cos' , 
and the debugger implicitly assumes all the functions return an 'int' ,
 if the function is not compiled in debug mode.

I derive my explanation from the following test case.

1. Create a function 'double test_cos(double)' and compile it in optimized mode with -O3 using gcc
 "$gcc -c -O3 test1.c"

2. Complie the file containing the 'main' function in debug mode
"$gcc -g test.c test1.o -lm"

============[gdb a.out]====================

(gdb) list 0
1       #include "math.h"
2       #include "test1.h"
3       double    (*fcn)(double);
4       main()
5       {
6                double t;
7                 fcn = cos;
8                  t=cos(.2);
9                   t=cos(-.2);
10                      test_cos(.2);
(gdb) call fcn(0.2)
$4 = 0.98006657784124163
(gdb) call cos(0.2)
$5 = -1073792992
(gdb) call test_cos(0.2)
$6 = -1073792992

#ifndef _test_1_h
#define _test_1_h
double test_cos(double);

#include "math.h"
#include "test1.h"
double test_cos(double a){
        return cos(a);

#include "math.h"
#include "test1.h"

double    (*fcn)(double);
         double t;
          fcn = cos;

May be the developers of gdb can you more details.

(gdb) call cos
$6 = {text variable, no debug info} 0xc325d0 {cos}

Wednesday, December 23, 2009

[TECH] Manipulating DNA sequences with bits

The alphabet of DNA sequences consists of 4 alphabets {A,T,G,C} and they exists in compliments -- A-T, G-C. Often in sequence assembly algorithms we need to squeeze these into bits for saving space since we have too many of them. For example a typical input to a sequence assembler consists of at least 10 million reads (string of fixed length, say 21). In my last post I briefly described the Sequence Assembly (SA) problem. As I pointed one of the basic operations on these reads would be to obtain a reverse compliment of a given read. For example our read is ATTACCAG , then its reverse compliment would be CTGGTAAT -- reverse the original sequence replace every symbol by its compliment (A-T, G-C). The following routines may be useful when we squeeze these symbols into bits using only 2bits per symbol. While using the bit-wise operators I realized that == operator has more precedence than & (bit-wise or). So if you a&b==b then that will always be true, so we need to be more careful use (a&b)==b.
#define MASK_A 3
#define MASK_T 0
#define MASK_G 1
#define MASK_C 2
/*Lets fix the maximum size of the read as 32 base pairs*/
typedef unsigned long long kmer_t;

/*Takes a read of length K and returns its bit represenation*/
kmer_t CreateBKmer(char *kmer, unsigned long K){
    unsigned long i;
    unsigned char bitc;
    /*A binary k-mer*/
    kmer_t bkmer = (kmer_t) 0; 
    for(i=0; i<K; i++){
        bkmer <<= 2; 
        bkmer |= Char2Bit(kmer[i]);
    return bkmer;

/*Return the reverse compliment of the kmer_a*/
kmer_t ReverseCompliment(kmer_t a, unsigned int K){
    kmer_t rc=0;
    unsigned int i=0;
    for(i=0; i<K; i++){
        rc <<=2;
        rc |= (3-(a&3));
        a >>=2;
    return rc;
/*Print the ASCII equivalent of the underlying bits*/
void PrintKmer(kmer_t a, unsigned int K, FILE *ptr){
    char ntide;
    kmer_t mask_a = MASK_A;
    kmer_t mask_t = MASK_T;
    kmer_t mask_g = MASK_G;
    kmer_t mask_c = MASK_C;

    mask_a <<= 2*(K-1); mask_t <<= 2*(K-1); 
    mask_g <<= 2*(K-1); mask_c <<= 2*(K-1);

        if((a&mask_a) == mask_a){
            ntide = 'A';
        }else if((a&mask_c) == mask_c){
            ntide = 'C';
        }else if((a&mask_g) == mask_g){
            ntide = 'G';
        }else{ /*This should be the last because T=0*/
            ntide = 'T';
        fprintf(ptr, "%c", ntide);
        mask_a >>=2; mask_t >>=2;
        mask_g >>=2; mask_c >>=2;
    }while(mask_a); /*MASK_A = 3 */
    fprintf(ptr, "\n");

inline char Bit2Char(unsigned char a){
        case MASK_A:
            return 'A';
        case MASK_G: 
            return 'G'; 
        case MASK_C:
            return 'C';
        case MASK_T:
            return 'T';

inline unsigned char Char2Bit(char c){
        case 'A':
            return MASK_A;
        case 'T':
            return MASK_T; 
        case 'G':
            return MASK_G;
        case 'C':
            return MASK_C; 

Friday, December 18, 2009

[TECH] Loops in Bi-directed De Bruijn Graphs

De Bruijn graphs recently found applications in Sequence Assembly($ SA$). $ SA$ problem is close to Shortest Common super String($ SCS$) problem, however there are some fundamental differences. Given a set of input strings $ S=\{s_1,s_2,s_3\ldots s_n\}$ strings the $ SCS$ problem outputs a string $ s^*$ such that every $ s_i$ is a sub-string in $ s^*$ and $ \nexists s': length(s') < length(s^*)$. On the other hand the input to the $ SA$ problem is a set $ R=\{r_1,r_2,r_3\ldots r_n\}$ of strings, every $ r_i$ is a randomly chosen sub-string of a bigger string $ G$ - called the genome. Unfortunately we do not know what $ G$ is, so given $ R$ the $ SA$ problem asks to find the $ G$ such that we can explain every $ r_i$. Adding to the complexity of $ SA$ problem $ G$ may contain several repeating regions and the direct appliction of $ SCS$ to solve $ SA$ problem would result in collapsing the repeating regions in $ G$ - so we cannot directly reduce the $ SA$ problem to $ SCS$. Also the string $ G$ is special string coming from the $ DNA$ so its double stranded and each $ r_i$ may be originating from the forward or reverse compliments of $ G$. \includegraphics[scale=0.5]{loops_debrujin.eps}

A $ k-$De Bruijn graph $ B^k(R)$ on a set of strings $ R$ is a graph whose vertices correspond to the unique $ k-$mers (a string of length $ k$) from all the $ r_i \in R$ and whose edges correspond to a some $ k+1-$mer in some $ r_j \in R$. Also to consider the double stranded-ness we also include a reverse compliment of every $ k-$mer into the graph $ B^k(R)$. In the following simple example I show how a loop (i.e. both the forward and reverse compliments overlapping each other by $ k-1$) can originate into $ B^k(R)$. This interesting situation seems to be referred to as hairpin-loop.

Saturday, December 12, 2009

[TECH] Simplified Proof For The Application Of Freivalds' Technique to Verify Matrix Multiplication

We first give a simple and alternative proof for Theorem-$ 7.2$ in [Randomized Algorithms, Motwani R. and Raghavan P., pp 162--163]. Later in Theorem 2 we show that the assumption on the uniformness is not necessary.
Let $A$,$B$\ and $C$\ be three $n\times n$\ matrices such that $...
...tribution. Then $P[AB\vec{r} = C\vec{r} \vert AB \neq C] \leq 1/2$

Proof. Let $ X$ be a $ n\times n$ matrix and $ \vec{x_1}, \vec{x_2}\ldots \vec{x_n}$ be the column vectors of $ X$. Then $ X\vec{r} = \sum_{i=1}^{n}r_i\vec{x_1}$. This means that multiplying a vector with a matrix is linear combination of the columns, the coefficient $ r_i$ is the $ i^{th}$ component of $ \vec{r}$. Since $ \vec{r}$ is a boolean and $ r_i$ acts as an indicator variable on the selection of column $ \vec{x_i}$. So if $ \vec{r}$ is chosen from a uniform distribution $ P[r_i=0] = P[r_i=1] = 1/2$.

Now let $ D=AB$ and $ \vec{d_1},\vec{d_2}\ldots \vec{d_n}$ be the column vectors of $ D$, similarly let $ \vec{c_1},\vec{c_2}\ldots \vec{c_n}$ be the column vectors of $ C$. Let $ Y =\{\vec{d_j} \vert \vec{d_j}\neq \vec{c_j}$, clearly $ \vert Y\vert \geq 1$ since $ C\neq D$. Then $ P[AB\vec{r} = C\vec{r} \vert AB \neq C] = \displaystyle\Pi_{\vec{d_i}\notin Y}P[r_i] = (1/2)^{n-\vert Y\vert} \leq 1/2$ since $ 1 \leq \vert Y\vert\leq n-1$. Intuitively this means we select our random vector $ \vec{r}$ such that $ r_i=0$ for all $ d_i \in Y$, such a selection will always ensure $ AB\vec{r} = C\vec{r}$ even though $ AB\neq C$. $ \qedsymbol$

Let $A$,$B$\ and $C$\ be three $n \times n$\ matrices. Let $\vec...
...$f(r)$\ is an arbitrary probability density/distribution

Proof. Continuing with the proof of Theorem- 1 , $ P[AB\vec{r} = C\vec{r}\vert AB\neq C] = \displaystyle\Pi_{\vec{d_i}\notin Y} P[r=r_i] \leq f(r)$. $ \qedsymbol$

There always exists an $\Theta(n^2)$\ time Monte Carlo algorit...
...creasing error probability, for the
problem to check if $AB=C$.