Just to build on my previous post, I thought I’d go ahead and actually do the scientometrics experiment I had described. I had data on the number of references in MIT EECS doctoral theses from the year 2010. I also collected the same data for the year 2004.
My first step was to exclude theses that had “chapter-style” references rather than all at the end as in “monograph-style”. Luckily (at least from my viewpoint), this “chapter-style” trend is not at all popular in EECS: for 2004 data, I excluded 4/66 theses and for 2010 data, 6/98 theses. On the contrary, as noted by Nils T. Hagen in a paper “Deconstructing doctoral dissertations: how many papers does it take to make a PhD?“:
The traditional single-authored monograph-style doctoral dissertation has declined into relative obscurity in the natural and biomedical sciences (cf. Lariviere et al. 2008), and been largely superseded by a modern PhD thesis consisting of a collection of multi-authored manuscripts and publications (e.g. Powell 2004).
After expurgating the data, I wanted to see whether there was a huge difference in the distributions. In terms of first order moments, in 2004 the average number of references was 99 whereas in 2010 the average number of references was 112. This definitely suggests that there has been an increase in number of references, as I had conjectured. To do a full distributional characterization, I plotted the empirical cumulative distribution functions, as seen below.
Again, it looks like there are generally more references in 2010 than in 2004. But is there a way to quantify it? In their paper, “Critical Values for the One-Sided Two-Sample Kolmogorov-Smirnov Statistic,” Mitchell H. Gail and Sylvan B. Green say:
The Kolmogorov-Smirnov one-sided two-sample statistic is used to test the null hypothesis F = G against the alternate F > G where F and G are distribution functions. If the random variables X and Y correspond to F and G, respectively, then the one-sided alternative is that Y is stochastically greater than X. For example, one is often interested in the one-sided alternative that survival times (Y) with a new medical treatment are longer than survival times (X) with conventional therapy. The two-sided alternative is of less interest in this case.
This seems perfect, so I ran the one-sided two-sample Kolmogorov-Smirnov test on the expurgated data, using matlab. In particular with the command:
[h,p] = kstest2(d_2010,d_2004,0.05,'smaller')
I got the result h = 1 and p = 0.0380. This implies that there is a statistically significant difference between the two distributions, and in particular theses from 2010 have more references than theses from 2004.
You might be worried that the Kolmogorov-Smirnov test is designed for continuous-valued data rather than discrete-valued data as here, but don’t be alarmed. As noted by W. J. Conover in a paper “A Kolmogorov Goodness-of-Fit Test for Discontinuous Distributions“:
Studies of the Kolmogorov test with discontinuous distributions appear to be quite limited. The Kolmogorov test is known to be conservative if F(x) is discrete.
and by G. E. Noether in a paper “Note on the Kolmogorov Statistic in the Discrete Case“:
A simple demonstration of the conservative character of the Kolmogorov test in the case of discrete distributions is given.
But even with this assurance, one might wonder why the 95% confidence value is the right one to choose or even whether the p-value approach to statistics is the right one overall.
On the 0.95 value, I was reading a paper by Simon that tried to quantify what regular people use as their standard of proof, e.g. in jury trials. She concluded that the standard of reasonable doubt was between 0.70 and 0.74, so maybe 0.95 is not the most appropriate choice. What do you think?
On the null hypothesis approach to statistics, I’ve recently been learning more about Bayesian data analysis methods, and am starting to feel that they might be a better way to go, but I am still not sure. What is your take?
Notwithstanding, standard statistics do have really neat results, e.g. the Dvoretzky–Kiefer–Wolfowitz inequality, which somehow doesn’t seem to be used as much as it could be. Are there any standard machine learning applications that you are aware of?