## The Drug called Google Scholar

October 17, 2011I was just reading a little bit about dose-response functions in medicine and how they may have different shapes. Somehow I had assumed that most treatments would have monotonic effects, but in fact they may not. Consequently, there is often a need to perform statistical tests to see whether there is a monotonic trend.

Continuing on with my scientometrics meme (this is getting worse than the birds, eh?), I went ahead and collected data for all years from 2004 to 2011 (the 2011 set does not yet contain all theses). I had previously demonstrated that there were indeed more references in 2010 than in 2004, but was that a coincidence, or is there an actual monotonic trend? If you plot out the mean and median, it does seem like there might be a noisy upward trend, but a formal test would be nice. Of course, I am thinking of “technological progress” as the dose and the number of references as the response.

Looking around the internet for the appropriate statistical test for monotonicity, I found that this area of order-restricted inference is actually not at all well-settled. This is particularly the case for unbalanced designs and non-parametric settings, as here. Often this is due to computational difficulties.

Notwithstanding, I decided to follow the regression-style method of Tukey et al. that “combines all the allowed principles of witchcraft.” As Tukey et al. argue, using a unified regression is better than pairwise KS-tests or their equivalents, as one might have considered trying.

In contrast to their setting where doses have actual measures, e.g. in milligrams, in my setting it is very unclear what the “dose of technological progress” is. Hence, rather than considering arithmetic, ordinal, and arithmetic-logarithmic candidate dose scalings and using the one with minimal p-value, I restricted myself to only ordinal scaling. As Capizzi et al. say, “the use of regression on a single scaling may generate controversy and doubt about one’s motives, especially in a regulatory environment,” but oh well.

Note that unlike traditional uses of regression for parameter estimation, here the goal is detection: to detect whether or not there is a monotonic trend.

So we have a sequence of ordinal doses, sample sizes, and mean responses as follows.

2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2010 | |

Ordinal Dose | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

Sample Size | 62 | 68 | 109 | 95 | 108 | 100 | 92 | 47 |

Mean Response | 98.87 | 104.01 | 98.11 | 102.21 | 101.57 | 107.23 | 111.83 | 108.15 |

With this data, I went ahead and used SPSS to perform linear regression, getting a positive slope of 1.75 and a p-value for the hypothesis test against the null hypothesis of slope 0 of 0.09. Hence, there is evidence in favor of a positive trend (at the 90% confidence level). If one were using matlab, then the functions regstats and linhyptest would be useful.

Although I had learned a little statistics when doing some connectomics work in the past (e.g. not to make this mistake), I am certainly learning much more these days. For example, when I was doing a Smarter Cities Challenge project for three weeks last month, I spent a good chunk of time with a statistician, who was big on what the data shows and picked up some tricks and tips.

Incidentally, you might have some interest in these two new neuroscience papers, from Allerton and NIPS, though maybe you have already found them by browsing, searching, or being alerted.

[…] Ashvins The Ultimate Machinists « The Drug called Google Scholar Strategy and Design October 18, […]

by Strategy and Design « Information Ashvins October 18, 2011 at 3:48 pm[…] Google Scholar effect you have pointed out is very interesting and a real manifestation of how technology changes the way […]

by Conversations « Information Ashvins December 7, 2011 at 4:08 pm