Sunday, December 19, 2010

Web Link Auto-Updating

Every so often, people will come across outdated links on a web page. It seems to me that this situation is mostly avoidable, or at the very least, we have the capability to resolve it to most people's satisfaction. Why not have a server-side script that checks, on a regular basis, the availability of the links in your web pages? If a link is unavailable for a predefined length of time, it will be made to point to a recently cached copy, and the webmaster (maybe you) will be notified. We could customize the script to go to Google's cache, archive.org, or if there is space on our server, a local cache.

One difficulty with this scheme is deciding when to replace the "cached" link with the original again. If we do this automatically based on the original link becoming available, we run the risk of that link pointing to a page that is significantly different than the one that was cached. We could decide on a metric of difference, and maintain the link to cached content if the metric exceeds a threshold. This would still require human intervention at some point, but would give webmasters some breathing room.

Doing this manually for sites of moderate to large size might be feasible if we observe that the population of link targets do not change significantly or go dark more than x days per year.

Saturday, September 18, 2010

Optimization in R

I recently performed this computation in R:

stats.irreg <- mclapply(firsts.irreg, function(x) list(means=sapply(cd(x), mean), meds=sapply(cd(x), median), sds=sapply(cd(x), sd), cvs=sapply(cd(x), cv)), mc.cores=2)

This iterates over a list of sublists of vectors (firsts.irreg), each vector being a time series. For each sublist a list is returned with the means, medians, stddevs, and c.o.v.s of the vectors in the sublist. I claim that R is doing much more computation than necessary. In this case, the code that reads 'means=' is computing means of each vector and 'sds= 'is computing standard deviations of each vector. The code that reads 'cvs=' is computing the coefficient of variation of each vector, which is the standard deviation divided by the mean. Thus, R will compute the mean and stddev of each vector twice. Is it possible or even desirable to have a language/interpreter/compiler recognize this and optimize it, or should it always be left to the programmer?

Wednesday, April 7, 2010

Auto EPS Conversion in pdflatex

I recently decided to start using PDF figures in my thesis instead of EPS and compiling my TeX code directly to PDF. This requires that you have the perl script 'epstopdf' installed on your system, as well as ghostscript and the associated machinery. The TeX package "epstopdf", combined with usepackage[pdftex]{graphicx} will auto convert any EPS figures used in your document to PDF. The actual conversion works wonderfully. There is a snag however, as described by the author of the package: even with TEXINPUTS set correctly, the 'epstopdf' script cannot find figures located in nested subdirectories.

A solution upon which I settled was to modify the script and put it in my local path before the real epstopdf (which is usually in /usr/bin). The trick is to use kpsewhich to scan all the associated TeX directories, including those recursively specified in TEXINPUTS. The fix is as follows:

Make a copy of 'epstopdf', and find the following lines in your new copy:

@ARGV &gt; 0 or die errorUsage "Input filename missing";
@ARGV &lt; 2 or die errorUsage "Unknown option or too many input files";
$InputFilename = $ARGV[0];

and change the last line to


chomp($InputFilename = `kpsewhich $ARGV[0]`);

This will prepend the correct path to InputFilename as found by kpsewhich.

So far this works like a charm on all my linux distros (RHEL and Ubuntu). YMMV!