## Wednesday, December 23, 2009

### [TECH] Manipulating DNA sequences with bits

The alphabet of DNA sequences consists of 4 alphabets {A,T,G,C} and they exists in compliments -- A-T, G-C. Often in sequence assembly algorithms we need to squeeze these into bits for saving space since we have too many of them. For example a typical input to a sequence assembler consists of at least 10 million reads (string of fixed length, say 21). In my last post I briefly described the Sequence Assembly (SA) problem. As I pointed one of the basic operations on these reads would be to obtain a reverse compliment of a given read. For example our read is ATTACCAG , then its reverse compliment would be CTGGTAAT -- reverse the original sequence replace every symbol by its compliment (A-T, G-C). The following routines may be useful when we squeeze these symbols into bits using only 2bits per symbol. While using the bit-wise operators I realized that == operator has more precedence than & (bit-wise or). So if you a&b==b then that will always be true, so we need to be more careful use (a&b)==b.
#define MASK_A 3
/*Lets fix the maximum size of the read as 32 base pairs*/
typedef unsigned long long kmer_t;

/*Takes a read of length K and returns its bit represenation*/
kmer_t CreateBKmer(char *kmer, unsigned long K){
unsigned long i;
unsigned char bitc;
/*A binary k-mer*/
kmer_t bkmer = (kmer_t) 0;
for(i=0; i<K; i++){
bkmer <<= 2;
bkmer |= Char2Bit(kmer[i]);
}
return bkmer;
}

/*Return the reverse compliment of the kmer_a*/
kmer_t ReverseCompliment(kmer_t a, unsigned int K){
kmer_t rc=0;
unsigned int i=0;
for(i=0; i<K; i++){
rc <<=2;
rc |= (3-(a&3));
a >>=2;
}
return rc;
}
/*Print the ASCII equivalent of the underlying bits*/
void PrintKmer(kmer_t a, unsigned int K, FILE *ptr){
char ntide;

do{
ntide = 'A';
ntide = 'C';
ntide = 'G';
}else{ /*This should be the last because T=0*/
ntide = 'T';
}
fprintf(ptr, "%c", ntide);
fprintf(ptr, "\n");
}

inline char Bit2Char(unsigned char a){
switch(a){
return 'A';
return 'G';
return 'C';
return 'T';
}
}

inline unsigned char Char2Bit(char c){
switch(c){
case 'A':
case 'T':
case 'G':
case 'C':
}
}

#### 1 comment:

Balanagireddy said...

Hi Vamsi,

Nenu balanagireddy ni. Stonybrook lo vunnanu ippudu..Ikkada MS chestunna..

I have question in your code, I didn't understand the purpose of this function : Bit2Char.