## Saturday, May 14, 2011

### [TECH] Enumerating set partitions with $DLX$

I have been a great fan of Dr. Knuth's paper about Dancing Links (Algorithm X) . In that paper Don suggests how the combinatorial search using backtracking can be improved in practice with the original idea from Hitotumatu and Noshita. One strange thing about Dancing Links is that its not Don's original idea, however Don made it popular by writing a paper about it. Don used the $DLX$ idea to solve the exact cover (matrix cover) problem and he gave several examples (e.g. Covering an rectilinear area with Pentominos) where he certain combinatorial problems to exact cover.

I have been waiting to find a problem where I get to apply $DLX$ without reducing it to exact cover problem. I'm really happy now that simultaneous hamming neighborhood problem (given a set $S=\{s_1,s_2\ldots s_n\},\, s_i\in \Sigma^{l}$ of strings compute the set $N_d(S) = \{ t | hamming(s_i,t)\leq d ,\forall s_i \in S\}$) which I have been working lately can be solved by applying $DLX$. As I mentioned in my earlier post the algorithm to compute $N_d(S)$ needs solve a set partition problem -- which further needs an algorithm to enumerate all the subsets of a given size $r$. In my previous post we have seen how to compute $n \choose r$ efficiently with backtracking. I'm now going use $DLX$ to solve the same problem. Notice that $DLX$ is an overkill to this simple problem, however it really makes sense if we look it from the perspective of the partitioning problem -- which is what we want to solve. $DLX$ just plays the role of UNDO operation see below.

/*Implementation of $n \choose r$ using $DLX$*/
enum_count_t EnumSubSetsCountDLX(size_t n, size_t r){
enum_count_t n_choose_r=0;
DLXState *dlx_sentinal = GetEnumSubSetsDLXState(n); /*Initialize the DLXState*/
DLXState *sidx_dat[r+1], **state_idx=sidx_dat; /*item picked in the state*/
/*total # of items picked in a given state*/
size_t scidx_dat[r+1], *state_count_idx=scidx_dat;
size_t level;

state_count_idx--; state_idx--; /*1-indexing*/
level=1; state_count_idx[level]=0; state_idx[level] = dlx_sentinal;
while(level){
if(level == r+1){
n_choose_r++;
level--;
if(level) Dance(state_idx[level]);
}else if(state_count_idx[level]+r-level > n-1){ /*state_idx[level]*/
level--;
if(level) Dance(state_idx[level]);
}else{
state_idx[level] = state_idx[level]->right;
Break(state_idx[level]);
state_count_idx[level]++;
/*track forward*/
state_count_idx[level+1] = state_count_idx[level];
state_idx[level+1] = state_idx[level];
level++;
}
}
CleanDLXState(dlx_sentinal); /*cleanup*/
return n_choose_r;
}

/*Break and Dance operations*/
void Break(DLXState *state){
state->left->right = state->right;
state->right->left = state->left;
}
void Dance(DLXState *state){
state->left->right = state;
state->right->left = state;
}

/*Initializes a new DLX state required to enumerate $n \choose r$*/
EnumSubSetState* GetNewEnumSStateWithDLX(DLXState *dlx_sentinal,
size_t n, size_t r){

EnumSubSetState *estate = malloc(sizeof(EnumSubSetState));
assert(estate);
estate->dlx_sentinal = dlx_sentinal;

estate->state_count_idx = malloc(sizeof(size_t)*(r+1));
estate->state_idx = malloc(sizeof(DLXState *)*(r+1));
assert(estate->state_count_idx && estate->state_idx);

(estate->state_count_idx)--; (estate->state_idx)--;

estate->level=1; estate->state_count_idx[1]=0;
estate->state_idx[1] = estate->dlx_sentinal;
estate->n = n; estate->r = r;
return estate;

}



Next I'll post the most interesting part which is to enumerate the partitions using $DLX$.

## Friday, May 06, 2011

### [TECH] Enumerating subsets with backtracking and some interesting observations on the backtracking enumeration tree.

During my current research on enumerating all the common Hamming neighborhood of a set of strings. I needed an algorithm which can enumerate seven disjoint subsets from a given set. The fundamental building block in that is an algorithm which enumerates all the subsets of size $r$ from a given set of size $n$. I'm very much aware of the standard recursive algorithm which uses the fact ${n \choose r} = {n-1 \choose r} + {n-1 \choose r-1}$. However I wanted a much efficient version because I need to use this several times and cannot afford the constants in stack based implementations. So I wrote the following to backtracking based algorithm to enumerate the subsets. Once I did that I started observing some very interesting patterns in the number of leaves for each internal backtrack node at level $r-1$.

/*Back tracking*/
enum_count_t EnumSubSetsPrint(char *set, size_t n, size_t r){
size_t state_idx_data[r+1], i, level;
size_t *state_idx = state_idx_data;
enum_count_t n_choose_r=0;

set--;
state_idx--;
for(i=1; i<=r+1; i++){ /*initialize*/
state_idx[i] = 0;
}

level = 1;
while(level){
if(level == r+1){
/*back track*/
for(i=1; i<=r; i++){
printf("%c ", set[state_idx[i]]);
}
printf("\n");
n_choose_r++;
level--;
}else if(state_idx[level]+r-level > n-1){
level--;
}else{
/*track forward*/
state_idx[level]++; /*pick a item*/
state_idx[level+1] = state_idx[level];
level++;
}
}
return n_choose_r;
}


By looking at these patterns (at the end of this paragraph) I came up with the following interesting identities for $n \choose 3$ and $n \choose 4$.
IDENTITY-1: ${n\choose 3} = \sum_{i=1}^{n-2} (n-i)(i-1)$
IDENTITY-2: ${n\choose 4} = \sum_{j=1}^{n-3}\sum_{i=1}^{n-2-j} (i)(j)$
I'm actually printing the number of leaves under an internal node at level $r-1$ in the backtracking tree. I'm observing the following pattern.

=================
8 3 0

6 5 4 3 2 1
5 4 3 2 1
4 3 2 1
3 2 1
2 1
1
(8 \choose 3) = 56

Please enter n and r < n and print_option
(e.g. 15 14 1 or 15 14 0)
9 3 0

7 6 5 4 3 2 1
6 5 4 3 2 1
5 4 3 2 1
4 3 2 1
3 2 1
2 1
1
(9 \choose 3) = 84

Please enter n and r < n and print_option
(e.g. 15 14 1 or 15 14 0)
10 3 0

8 7 6 5 4 3 2 1
7 6 5 4 3 2 1
6 5 4 3 2 1
5 4 3 2 1
4 3 2 1
3 2 1
2 1
1
(10 \choose 3) = 120
Please enter n and r < n and print_option
(e.g. 15 14 1 or 15 14 0)
11 3 0

9 8 7 6 5 4 3 2 1
8 7 6 5 4 3 2 1
7 6 5 4 3 2 1
6 5 4 3 2 1
5 4 3 2 1
4 3 2 1
3 2 1
2 1
1
===============================

The following are for $n \choose 4$ for various values of $n$.
===============================
8 4 0

5 4 3 2 1
4 3 2 1
3 2 1
2 1
1
4 3 2 1
3 2 1
2 1
1
3 2 1
2 1
1
2 1
1
1
(8 \choose 4) = 70

Please enter n and r < n and print_option
(e.g. 15 14 1 or 15 14 0)
9 4 0

6 5 4 3 2 1
5 4 3 2 1
4 3 2 1
3 2 1
2 1
1
5 4 3 2 1
4 3 2 1
3 2 1
2 1
1
4 3 2 1
3 2 1
2 1
1
3 2 1
2 1
1
2 1
1
1

===============================


Play around with this program Enumerate.c , Enumerate.h.

Finally I enabled $LaTeX$ setting for the blog.

## Thursday, May 05, 2011

### [THEORY] Is nature a giant guessing algorithm ?

I was reading Dr. Lipton's blog on guessing and was really impressed on how he relates the power of guessing with the question whether P=NP or P!=NP.

Looks like nature on the other had seems to be solving several combinatorial problems almost instantly (e.g. folding of the proteins to minimize its energy), does nature guess the solution to the problem ?. One way to guess a solution to a problem is to look inside the building blocks which formulated the problem. Dr. Lipton gives an example were we were supposed to guess the digits which can open a lock -- the lock we normally use in the GYM. It may be hard to guess the digits which can open a lock without any help. However if we can get an X-ray of the levers we can guess the solution to the problem more easily. So does this mean nature looks at the problems in a totally different perspective ? , is there something very intuitive to the nature which the humans fail to recognize ?.

Why did all the great mathematicians failed to find a polynomial solution 3-SAT ? -- or prove it does not exist. Is there something missing in the way we are looking at a digital circuit ? , can we design guessing algorithms to solve the NP-Complete problems ?. We don't know the answers for all these. But I'm sure that nature is a great guessing algorithm and it may be impossible for us to guess its guessing algorithm.

## Tuesday, April 12, 2011

### [TECH] Unbuffered message logging of interactive programs.

We generally use one of the following methods to log the output messages of programs which print a lot of messages.

•  $./program >& message.log  • $ ./program |& tee message.log
Since stdout is line buffered we expect the same to hold when we re-direct stdout or any other stream. However this is not true if you keep monitoring the program status with tail -f message.log you will not notice any output from the program this is because the redirection is not line buffered. And in certain unfortunate cases when the running program aborts the redirection may not even flush the buffers and you find that message.log is empty and there is NO way to figure out what has happened.

One obvious way to fix this is to put fflush(stdout) after every printf statement you have, this is OK for small programs but for very large programs its really tedious. I came up with the following technique to solve this problem. This solution is generic.

[main.c]

int main(int argc, char **argv){

ForceLogMessages("message.log");
....
...
}

void ForceLogMessages(const char *mesg_file){
int ret_setvbuf;
/*close the stdout and re-open a text file*/
fclose(stdout);
stdout = fopen(name, "w");
//make stdout unbuffered
ret_setvbuf = setvbuf(stdout, (char *) NULL, _IOLBF, 0);
if(ret_setvbuf){
fprintf(stderr, "UNABLE TO MAKE stdout UNBUFFERED %s", strerror(errno));
}
assert(stdout);
}


Now if we run the program and we can monitor the status with tail -f message.log since we have made stdout unbuffered we can see a message in message.log every time a printf is executed in the original program and you can know what exactly is happening.

## Thursday, March 24, 2011

### [TECH] juggernaut-asm: An out-of-core sequence assembler

Development updates on the juggernaut-asm project.I'm now branching off the trunk to implement the feature which avoids using the UNIX sort. To checkout the main branch as follows. This is the version of the code I used to obtain the results on 4million reads which the in-memory algorithm of Velvet could not handle.

cvs -d:pserver:anonymous@juggernaut-asm.cvs.sourceforge.net:/cvsroot/juggernaut-asm co .


The code to replace the unix sort being developed in the branch name 'replace-unix-sort'.

cvs -d:ext:vamsi99@juggernaut-asm.cvs.sourceforge.net:/cvsroot/juggernaut-asm co -r replace-unix-sort .


## Wednesday, March 23, 2011

### [TECH] External R-Way Merge sorting.

http://lib-ex-sort.sourceforge.net/ is an external sorting program the following are the limitations I wish to remove before I leave in May. Most of them are engineering changes but are very useful for several algorithms based on external sorting.

• Currently the size of each key is a constant. We need to remove this by adding a key_header for each key.
• Use MMAP for integer sorting to avoid copy between user and kernel space.
• Support for sorting when the data spans multiple files.