Tuesday, April 29, 2008

[TECH] A simple UNIX based algorithm to run a set of tasks in parallel.

Given a set of tasks T={t1,t2....,tn} such that these tasks can be run independently in parallel, however we have a limited set of resources and our constraint is K which tells us the maximum number of process's which can run in parallel. Our problem now is to realize this in UNIX using standard system calls like fork and wait.

I came across this recently and I came across the same thing several times before. Some examples of the contexts which would need this are as follows.

  • Running test cases (regressions) in parallel, suppose we have several thousands of test cases for a product and we wish to run on a machine which can run K tests at any given instance.
  • Making builds in parallel
  • Parallel crawling/web indexing (which was my motivation to write this)

Download this code

$#ARGV == 1 or die("USAGE: perl parallel_runner.pl {file_with_commands} {no.of.process's}");
$command_file = $ARGV[0];
open(CMD_FILE,$command_file) or die("$!");
$max_process = $ARGV[1];
($max_process < 256) or die("Too many process's....use less number");
%child_hash;
$process_count = 0;
while(<CMD_FILE>){
 $cmd = $_;
 if($cmd =~ /^\#/){
  next;
 }
 if($process_count < $max_process){
  $pid = fork();
  if($pid < 0 ){
   die("$! : RESOURCE ERRORS");
  }elsif($pid == 0){
   system("$cmd");
   exit(0);
  }else{
   $child_hash{$pid} = 1;
   $process_count++;
  }
 }

 if($process_count==$max_process){
  $pid = wait();
  if($child_hash{$pid} == 1){
   $child_hash{$pid} = 0;
  }elsif($pid == -1){
   print "No childs to wait?? Confused...quitting\n";
   last;
  } 
  $process_count--;
 }
}
print "Done creating process's...now waiting for childs...\n";
$i = 0;
while($i < $max_process){
 $pid = wait();
 if($pid == -1){
  last;
 }
 $i++;
}

Wednesday, April 23, 2008

[TECH] Computing all the derivatives of a 'n' degree polynomial at point 'x=a' in O(nlog(n)) time.

I came across this problem recently, basically you are given a polynomial f(x) = anxn+an-1xn-1+....a0 and you need to evaluate all its derivatives at a given point say x=a. I guess one of the suggested algorithms for this problem runs in O(nlog2(n)). But I found an idea to solve this in just O(nlog(n)). The following is the brief description of my algorithm

The idea is that we can post this as a simple multiplication of two n degree polynomials, the coefficients resulting from the multiplication will be the evaluated derivatives at a given point x=k. And as we might know that we can use FFT to multiply two n-degree polynomials in O(nlog(n))

  • STEP1: Compute the values of k2,k3....,kn in O(n) time. Also compute the values of n!,(n-1)!,....2! incrementally. The runtime of this step is O(n)
  • STEP2: Create polynomial p(x) = (n!an)xn-1+((n-1)!an-1)xn-2+.....a1.
  • STEP 3: Create polynomial g(x) = (kn-1/(n-1)!)x+(kn-2/(n-2)!)x2+(kn-3/(n-3)!)x3+.... +kxn-1 .
  • STEP 4: Multiply p(x) and g(x) using FFT in O(nlog(n) time.
  • STEP 5: Clearly coefficient of xn-i+1 gives the value of ith derivative.

Saturday, April 19, 2008

[TECH] What does computing the edit distance between a string and its reverse tell us?

I came across a problem recently which asks for the following "Given a string find out the minimum number of characters to be inserted to make it a palindrome, and also give the string". There might be several ways to solve this problem I have a simple algorithm to solve this problem in O(n2). The observation is the following reccurence.

Let LCSr[x,y] = longest common sub sequence between S1...x and Sr1...y , where Sr is the reverse of the string S.
Let X=x1x2....xk be the be the palindrome with minimum # of inserted characters to make it a palindrome, then X has to be formed from S as follows.
1. Find a index k in S such that the cost of transforming S[1..k] to S[k+1...n] is minimum and the palindrome would be X = S'[1...k]S'r[k+1...n] or X = S'[1..k-1]S[k]S'r[k+1..n]
2. The index k such that LCSr[k][len-k] is maximum.
3. Get the code here

/*Builds the LCS between the forward and backwards and finds
a index 'p' which will give minimum # of operations to transform
forward string to backward string.*/
void FindLCSReverse(char *str,unsigned int len){
    int i,j;
    int current_min,max_lcs;
    int imax,jmax,palindex,palindex_r;
    unsigned int operations;
    char path=0;
    for(i=0;i<len;i++){
        LCS[i][0] = 0;
    }
    for(j=0;j<len;j++){
        LCS[0][j] = 0;
    }
    for(i=1;i<len;i++){
        for(j=1;j<len;j++){
            /*use len-j to access other string*/
            if(buf[i] == buf[len-j]){
                LCS[i][j] = LCS[i-1][j-1]+1;
            }else{
                LCS[i][j] = (LCS[i-1][j]>LCS[i][j-1])?LCS[i-1][j]:LCS[i][j-1];
            }
        }
    }
    /*Compute the p which gives minimum inserts to transform 
     *forward string to backward
     */
    max_lcs=0;
    imax=len-1;jmax=0;
    for(i=len-1;i>=1;i--){
        if(LCS[i][len-i-1] > max_lcs){
            max_lcs = LCS[i][len-i-1];
            imax = i;
            jmax = len-i-1;
        }
        if(i>0 && LCS[i-1][len-i-1] >= max_lcs){
            max_lcs = LCS[i-1][len-i-1];
            imax = i-1;
            jmax = len-i-1;
        }
    }
    /*Now find the actual string*/
    i=imax;
    j=jmax; 

    palindex_r=0;
    while(i!=0 || j!=0){
        if(i>0 && j>0){
            if(buf[i] == buf[len-j]){
                palindrome_rev[palindex_r++] = buf[i];
                i--; j--;
                continue;
            }
            current_min = (LCS[i-1][j] > LCS[i][j-1])?LCS[i-1][j]:LCS[i][j-1];
            if(current_min == LCS[i-1][j]){
                if(current_min == LCS[i][j-1] && buf[i] < buf[len-j]){
                    palindrome_rev[palindex_r++] = buf[len-j];
                    j--;
                }else{
                    palindrome_rev[palindex_r++] = buf[i];
                    i--;
                }
            }else if(current_min == LCS[i][j-1]){
                if(current_min == LCS[i-1][j] && buf[len-j] < buf[i]){
                    palindrome_rev[palindex_r++] = buf[i];
                    i--;
                }else{
                    palindrome_rev[palindex_r++] = buf[len-j];
                    j--;
                }
            }
        }else if(j==0){
            palindrome_rev[palindex_r++] = buf[i];
            i--;
        }else if(i==0){
            palindrome_rev[palindex_r++] = buf[len-j];
            j--;
        }
        /*path=1 (left) path=2 (down) path=3 (diag)*/
    }

    for(i=0;i<palindex_r;i++){
        palindrome[palindex_r-1-i] = palindrome_rev[i];
    }
    palindrome_rev[palindex_r] = '\0';
    palindrome[palindex_r] = '\0';

    /*printf("imax = %u jmax = %u len=%u \n",imax,jmax,len);*/
    if(imax+jmax==len-1){
        printf("%s%s\n",palindrome,palindrome_rev);
    }else{
        printf("%s%c%s\n",palindrome,buf[imax+1],palindrome_rev);
    }
}

Tuesday, April 15, 2008

[TECH] Reverse Engineering and Creating Crawler BOTS.

I have my share of pleasure reverse engineering the underlying details of SCOPUS. SCOPUS as you might know is the most popular scholarly database used for citation searching. I was trying to solve this problem "Given a research paper X produce a set of research papers in the same connected component of the CITATION GRAPH", Let me define a CITATION GRAPH (CITE(V,E)) , V = {set of all research papers} and E={(i,j)} set of all directed edges from research paper 'i' to 'j' such that paper 'j' refers paper 'i' in its references. This edge information comes from SCOPUS however SCOPUS gives only one level (depth 1) in the connected component of all related papers, my goal is to get all the related papers (related in the sense fall in the same connected component of the CITE graph).

The reverse engineering the underlying comes handy when we want to automate the process of searching all these from the browser ourself.

I'm too tired to explain the details of the program which I had written using perl+LWP to create a CRAWLER BOT which gets all the related papers but if you need similar stuff sure the code can help click here

Unfortunately I don't get enough time to write blogs but in past few weeks I had some very interesting technical stuff I want to write.