#define MASK_A 3 #define MASK_T 0 #define MASK_G 1 #define MASK_C 2 /*Lets fix the maximum size of the read as 32 base pairs*/ typedef unsigned long long kmer_t; /*Takes a read of length K and returns its bit represenation*/ kmer_t CreateBKmer(char *kmer, unsigned long K){ unsigned long i; unsigned char bitc; /*A binary k-mer*/ kmer_t bkmer = (kmer_t) 0; for(i=0; i<K; i++){ bkmer <<= 2; bkmer |= Char2Bit(kmer[i]); } return bkmer; } /*Return the reverse compliment of the kmer_a*/ kmer_t ReverseCompliment(kmer_t a, unsigned int K){ kmer_t rc=0; unsigned int i=0; for(i=0; i<K; i++){ rc <<=2; rc |= (3-(a&3)); a >>=2; } return rc; } /*Print the ASCII equivalent of the underlying bits*/ void PrintKmer(kmer_t a, unsigned int K, FILE *ptr){ char ntide; kmer_t mask_a = MASK_A; kmer_t mask_t = MASK_T; kmer_t mask_g = MASK_G; kmer_t mask_c = MASK_C; mask_a <<= 2*(K-1); mask_t <<= 2*(K-1); mask_g <<= 2*(K-1); mask_c <<= 2*(K-1); do{ if((a&mask_a) == mask_a){ ntide = 'A'; }else if((a&mask_c) == mask_c){ ntide = 'C'; }else if((a&mask_g) == mask_g){ ntide = 'G'; }else{ /*This should be the last because T=0*/ ntide = 'T'; } fprintf(ptr, "%c", ntide); mask_a >>=2; mask_t >>=2; mask_g >>=2; mask_c >>=2; }while(mask_a); /*MASK_A = 3 */ fprintf(ptr, "\n"); } inline char Bit2Char(unsigned char a){ switch(a){ case MASK_A: return 'A'; case MASK_G: return 'G'; case MASK_C: return 'C'; case MASK_T: return 'T'; } } inline unsigned char Char2Bit(char c){ switch(c){ case 'A': return MASK_A; case 'T': return MASK_T; case 'G': return MASK_G; case 'C': return MASK_C; } }
Algorithms, Theory, Spirituality, Life, Technology, Food and Workout : trying to sort these deterministically in $\Theta(1)$ time (constant time).
Wednesday, December 23, 2009
[TECH] Manipulating DNA sequences with bits
The alphabet of DNA sequences consists of 4 alphabets {A,T,G,C} and they exists in compliments -- A-T, G-C. Often in sequence assembly algorithms we need to squeeze these into bits for saving space since we have too many of them. For example a typical input to a sequence assembler consists of at least 10 million reads (string of fixed length, say 21). In my last post I briefly described the Sequence Assembly (SA) problem. As I pointed one of the basic operations on these reads would be to obtain a reverse compliment of a given read. For example our read is ATTACCAG , then its reverse compliment would be CTGGTAAT -- reverse the original sequence replace every symbol by its compliment (A-T, G-C). The following routines may be useful when we squeeze these symbols into bits using only 2bits per symbol. While using the bit-wise operators I realized that == operator has more
precedence than & (bit-wise or). So if you a&b==b then that will always be true, so we need to be more careful use (a&b)==b.
Subscribe to:
Post Comments (Atom)
1 comment:
Hi Vamsi,
Nenu balanagireddy ni. Stonybrook lo vunnanu ippudu..Ikkada MS chestunna..
I have question in your code, I didn't understand the purpose of this function : Bit2Char.
Can you please elaborate.
-- Bala
Post a Comment