Wednesday, January 30, 2008

[TECH] A Contradictory gcc message.

I have been experimenting on some of my PDM (parallel disk model) sorting code and wanted to create huge files with billions of keys I wanted to create a file with 1 billion integers (1024*1024*1024*4), the program stopped after some time saying that it "File size limit exceeded" , however in my shell (csh) the 'limit' command showed unlimited.

[vamsik@abadon PDMSorting]$ ./pdm_sort 
Setting RAND_SEED to 1201763856 
Filesize limit exceeded
[vamsik@abadon PDMSorting]$ ls -al key_file.txt 
---------- 1 vamsik fuse 2147483647 Jan 31 02:18 key_file.txt
[vamsik@abadon PDMSorting]$ 

I saw that the file was created without any permissions, so I tweaked my 'umask' but nothing changed (I was doing a open with (O_WRONLY|O_CREAT)). You might be wondering what exactly I'm trying to say in this post, in fact the story just started, rather than setting my 'umask' to 'umask 22', I have set it to 'umask 755' and as usual I was doing a build with 'make' this is what happened.

[vamsik@abadon PDMSorting]$ make
gcc  -O2  -I../myutil/            -c ExternalSort.c
ExternalSort.c:1: fatal error: can't open /tmp/cct9kuSr.s for writing: Permission denied
compilation terminated.
The bug is not reproducible, so it is likely a hardware or OS problem.
make: *** [ExternalSort.o] Error 1
[vamsik@abadon PDMSorting]$ 

Looks strange right? , see the funny thing it says "The bug is not reproducible, so it is likely a hardware or OS problem." , its really a stupid error message how can it say it cannot be reproduces? , just set 'umask 755' and do a 'gcc' on any file its going to say the same thing, in fact the message is a utter contradiction because we can reproduce this by changing the 'umask'. I'm going to report this to the 'gcc' maintainers or submit a patch to the maintainers.

Monday, January 14, 2008

[TECH] Ideas for Optimizing Design pattern implementations with Stack Collapsing

One major drawback I guess with all these object oriented systems I guess is performance, since people write the code so that its extensible it always ends up creating deep stack sizes, recently I was looking into (Jmol) this structure visualization tool supports several input format descriptions for the structures (pdb,cif,molxyz....). This is how they hide format details from the display code
JmolAdapter (abstract class)
==> SmarterJmolAdapter (this contains a interface called AtomSetCollection ). So depending on the input file we have several AtomSetCollection readers like PdbAtomSetCollection etc...

Although I came to a conclusion that to implement the structural alignment algorithm into Jmol, I need to implement this AtomSetCollection which structurally aligns the protein structures, but I guess there are several draw backs of the implementation of object oriented java implementation, the cost of creating an extensible design does not come for free it comes at the cost of performance, I found that a simple command execution could end up creating a stack size of 32 this is what concerns me is there a way we can collapse the stack when we know that the intermediate functions on the stack are just delegating the call to the actual instance, I guess this STACK COLLAPSING technique can always be applied when ever the Adapter design pattern is used. For example see the below stack of size 32, its not doing any thing great its just opening a new script file, because of the way the code is written (use of design patterns) the stack size seem to increase a lot and ULTIMATELY ALL THESE PATTERNS ARE JUST DELEGATING A FUNCTION CALL TO TO FUNCTION ON TOP OF THE STACK, I'm think of ways to improve the delegation since at each level of the stack you are doing a look up and it increases linearly with the size of the stack.

  [1] org.jmol.viewer.ScriptManager.addScript (
  [2] org.jmol.viewer.ScriptManager.addScript (
  [3] org.jmol.viewer.Viewer.evalStringQuiet (,152)
  [4] org.jmol.viewer.Viewer.evalString (,119)
  [5] org.jmol.viewer.Viewer.openFile (,379)
  [6]$OpenAction.actionPerformed (,435)
  [7] javax.swing.AbstractButton.fireActionPerformed (,849)
  [8] javax.swing.AbstractButton$Handler.actionPerformed (,169)
  [9] javax.swing.DefaultButtonModel.fireActionPerformed (
  [10] javax.swing.DefaultButtonModel.setPressed (
  [11] javax.swing.AbstractButton.doClick (
  [12] javax.swing.plaf.basic.BasicMenuItemUI.doClick (,051)
  [13] javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased (,092)
  [14] java.awt.Component.processMouseEvent (,517)
  [15] javax.swing.JComponent.processMouseEvent (,135)
  [16] java.awt.Component.processEvent (,282)
  [17] java.awt.Container.processEvent (,966)
  [18] java.awt.Component.dispatchEventImpl (,984)
  [19] java.awt.Container.dispatchEventImpl (,024)
  [20] java.awt.Component.dispatchEvent (,819)
  [21] java.awt.LightweightDispatcher.retargetMouseEvent (,212)
  [22] java.awt.LightweightDispatcher.processMouseEvent (,892)
  [23] java.awt.LightweightDispatcher.dispatchEvent (,822)
  [24] java.awt.Container.dispatchEventImpl (,010)
  [25] java.awt.Window.dispatchEventImpl (,791)
  [26] java.awt.Component.dispatchEvent (,819)
  [27] java.awt.EventQueue.dispatchEvent (
  [28] java.awt.EventDispatchThread.pumpOneEventForHierarchy (
  [29] java.awt.EventDispatchThread.pumpEventsForHierarchy (
  [30] java.awt.EventDispatchThread.pumpEvents (
  [31] java.awt.EventDispatchThread.pumpEvents (
  [32] (

Saturday, January 05, 2008

[TECH] Association rules of by Data Mining (TM Algorithm) on Cancer Data.

I have datamined the Cancer data at using the TM(Transaction Mapping) and FP-Growth algorithm, this is what I have done to mine the association rules.

1. Randomly partition the data into two parts, I partitioned the data into part1 of size = 148458 records, part2 of size = 153897.
2. Used the part1 (148458 records) and found association rules of support >=0.4 and confidence >=0.4 , I got 72 rules from this.
3. For each of the rule (in step 2) I found the support and confidence of each of the rules in part2, it looks like the support and confidence is close to the support and confidence in training data (part1).

The 72 rules of step1 [ ]

Support and Confidence of each of this rules in part2
[ ]

I have made the rules human readable removing all the encoding please
see the rules
[ ]

These are in the following format
SUP:0.402 ,CONF:0.412,TRAIN_SUP:0.404,TRAIN_CONF:0.414
Diagnosis of invasive breast cancer within one year of the index
screening mammogram = no,
Diagnosis of invasive or ductal carcinoma in situ breast cancer within
one year of the index screening mammogram = no,
menopaus = postmenopausal or age>=55,
hispanic = no,

SUP indicates support of this rule in part2 , CONF indicates confidence of this
rule in part2, TRAIN_SUP indicates the support of this rule in part1 and
TRAIN_CONF indicates the confidence of this rule in part1.

These rules may not make any sense for me but it might make sense for a cancer doctor. There are several useful perl programs for people who want to do some datamining please feel free to use them , let me know if you have any questions.