Wednesday, December 23, 2009

[TECH] Manipulating DNA sequences with bits

The alphabet of DNA sequences consists of 4 alphabets {A,T,G,C} and they exists in compliments -- A-T, G-C. Often in sequence assembly algorithms we need to squeeze these into bits for saving space since we have too many of them. For example a typical input to a sequence assembler consists of at least 10 million reads (string of fixed length, say 21). In my last post I briefly described the Sequence Assembly (SA) problem. As I pointed one of the basic operations on these reads would be to obtain a reverse compliment of a given read. For example our read is ATTACCAG , then its reverse compliment would be CTGGTAAT -- reverse the original sequence replace every symbol by its compliment (A-T, G-C). The following routines may be useful when we squeeze these symbols into bits using only 2bits per symbol. While using the bit-wise operators I realized that == operator has more precedence than & (bit-wise or). So if you a&b==b then that will always be true, so we need to be more careful use (a&b)==b.
#define MASK_A 3
#define MASK_T 0
#define MASK_G 1
#define MASK_C 2
/*Lets fix the maximum size of the read as 32 base pairs*/
typedef unsigned long long kmer_t;

/*Takes a read of length K and returns its bit represenation*/
kmer_t CreateBKmer(char *kmer, unsigned long K){
    unsigned long i;
    unsigned char bitc;
    /*A binary k-mer*/
    kmer_t bkmer = (kmer_t) 0; 
    for(i=0; i<K; i++){
        bkmer <<= 2; 
        bkmer |= Char2Bit(kmer[i]);
    return bkmer;

/*Return the reverse compliment of the kmer_a*/
kmer_t ReverseCompliment(kmer_t a, unsigned int K){
    kmer_t rc=0;
    unsigned int i=0;
    for(i=0; i<K; i++){
        rc <<=2;
        rc |= (3-(a&3));
        a >>=2;
    return rc;
/*Print the ASCII equivalent of the underlying bits*/
void PrintKmer(kmer_t a, unsigned int K, FILE *ptr){
    char ntide;
    kmer_t mask_a = MASK_A;
    kmer_t mask_t = MASK_T;
    kmer_t mask_g = MASK_G;
    kmer_t mask_c = MASK_C;

    mask_a <<= 2*(K-1); mask_t <<= 2*(K-1); 
    mask_g <<= 2*(K-1); mask_c <<= 2*(K-1);

        if((a&mask_a) == mask_a){
            ntide = 'A';
        }else if((a&mask_c) == mask_c){
            ntide = 'C';
        }else if((a&mask_g) == mask_g){
            ntide = 'G';
        }else{ /*This should be the last because T=0*/
            ntide = 'T';
        fprintf(ptr, "%c", ntide);
        mask_a >>=2; mask_t >>=2;
        mask_g >>=2; mask_c >>=2;
    }while(mask_a); /*MASK_A = 3 */
    fprintf(ptr, "\n");

inline char Bit2Char(unsigned char a){
        case MASK_A:
            return 'A';
        case MASK_G: 
            return 'G'; 
        case MASK_C:
            return 'C';
        case MASK_T:
            return 'T';

inline unsigned char Char2Bit(char c){
        case 'A':
            return MASK_A;
        case 'T':
            return MASK_T; 
        case 'G':
            return MASK_G;
        case 'C':
            return MASK_C; 

1 comment:

Balanagireddy said...

Hi Vamsi,

Nenu balanagireddy ni. Stonybrook lo vunnanu ippudu..Ikkada MS chestunna..

I have question in your code, I didn't understand the purpose of this function : Bit2Char.

Can you please elaborate.

-- Bala