## Wednesday, December 23, 2009

### [TECH] Manipulating DNA sequences with bits

The alphabet of DNA sequences consists of 4 alphabets {A,T,G,C} and they exists in compliments -- A-T, G-C. Often in sequence assembly algorithms we need to squeeze these into bits for saving space since we have too many of them. For example a typical input to a sequence assembler consists of at least 10 million reads (string of fixed length, say 21). In my last post I briefly described the Sequence Assembly (SA) problem. As I pointed one of the basic operations on these reads would be to obtain a reverse compliment of a given read. For example our read is ATTACCAG , then its reverse compliment would be CTGGTAAT -- reverse the original sequence replace every symbol by its compliment (A-T, G-C). The following routines may be useful when we squeeze these symbols into bits using only 2bits per symbol. While using the bit-wise operators I realized that == operator has more precedence than & (bit-wise or). So if you a&b==b then that will always be true, so we need to be more careful use (a&b)==b.
#define MASK_A 3
#define MASK_T 0
#define MASK_G 1
#define MASK_C 2
/*Lets fix the maximum size of the read as 32 base pairs*/
typedef unsigned long long kmer_t;

/*Takes a read of length K and returns its bit represenation*/
kmer_t CreateBKmer(char *kmer, unsigned long K){
unsigned long i;
unsigned char bitc;
/*A binary k-mer*/
kmer_t bkmer = (kmer_t) 0;
for(i=0; i<K; i++){
bkmer <<= 2;
bkmer |= Char2Bit(kmer[i]);
}
return bkmer;
}

/*Return the reverse compliment of the kmer_a*/
kmer_t ReverseCompliment(kmer_t a, unsigned int K){
kmer_t rc=0;
unsigned int i=0;
for(i=0; i<K; i++){
rc <<=2;
rc |= (3-(a&3));
a >>=2;
}
return rc;
}
/*Print the ASCII equivalent of the underlying bits*/
void PrintKmer(kmer_t a, unsigned int K, FILE *ptr){
char ntide;
kmer_t mask_a = MASK_A;
kmer_t mask_t = MASK_T;
kmer_t mask_g = MASK_G;
kmer_t mask_c = MASK_C;

mask_a <<= 2*(K-1); mask_t <<= 2*(K-1);
mask_g <<= 2*(K-1); mask_c <<= 2*(K-1);

do{
if((a&mask_a) == mask_a){
ntide = 'A';
}else if((a&mask_c) == mask_c){
ntide = 'C';
}else if((a&mask_g) == mask_g){
ntide = 'G';
}else{ /*This should be the last because T=0*/
ntide = 'T';
}
fprintf(ptr, "%c", ntide);
mask_a >>=2; mask_t >>=2;
mask_g >>=2; mask_c >>=2;
}while(mask_a); /*MASK_A = 3 */
fprintf(ptr, "\n");
}

inline char Bit2Char(unsigned char a){
switch(a){
case MASK_A:
return 'A';
case MASK_G:
return 'G';
case MASK_C:
return 'C';
case MASK_T:
return 'T';
}
}

inline unsigned char Char2Bit(char c){
switch(c){
case 'A':
return MASK_A;
case 'T':
return MASK_T;
case 'G':
return MASK_G;
case 'C':
return MASK_C;
}
}

#### 1 comment:

Balanagireddy said...

Hi Vamsi,

Nenu balanagireddy ni. Stonybrook lo vunnanu ippudu..Ikkada MS chestunna..

I have question in your code, I didn't understand the purpose of this function : Bit2Char.

Can you please elaborate.

-- Bala