Saturday, September 18, 2010

Optimization in R

I recently performed this computation in R:

stats.irreg <- mclapply(firsts.irreg, function(x) list(means=sapply(cd(x), mean), meds=sapply(cd(x), median), sds=sapply(cd(x), sd), cvs=sapply(cd(x), cv)), mc.cores=2)

This iterates over a list of sublists of vectors (firsts.irreg), each vector being a time series. For each sublist a list is returned with the means, medians, stddevs, and c.o.v.s of the vectors in the sublist. I claim that R is doing much more computation than necessary. In this case, the code that reads 'means=' is computing means of each vector and 'sds= 'is computing standard deviations of each vector. The code that reads 'cvs=' is computing the coefficient of variation of each vector, which is the standard deviation divided by the mean. Thus, R will compute the mean and stddev of each vector twice. Is it possible or even desirable to have a language/interpreter/compiler recognize this and optimize it, or should it always be left to the programmer?