Archive for August, 2011


Reasonable Doubt

August 26, 2011

Just to build on my previous post, I thought I’d go ahead and actually do the scientometrics experiment I had described.  I had data on the number of references in MIT EECS doctoral theses from the year 2010.  I also collected the same data for the year 2004.

My first step was to exclude theses that had “chapter-style” references rather than all at the end as in “monograph-style”.  Luckily (at least from my viewpoint), this “chapter-style” trend is not at all popular in EECS: for 2004 data, I excluded 4/66 theses and for 2010 data, 6/98 theses.  On the contrary, as noted by Nils T. Hagen in a paper “Deconstructing doctoral dissertations: how many papers does it take to make a PhD?“:

The traditional single-authored monograph-style doctoral dissertation has declined into relative obscurity in the natural and biomedical sciences (cf. Lariviere et al. 2008), and been largely superseded by a modern PhD thesis consisting of a collection of multi-authored manuscripts and publications (e.g. Powell 2004).

After expurgating the data, I wanted to see whether there was a huge difference in the distributions.  In terms of first order moments, in 2004 the average number of references was 99 whereas in 2010 the average number of references was 112.  This definitely suggests that there has been an increase in number of references, as I had conjectured.  To do a full distributional characterization, I plotted the empirical cumulative distribution functions, as seen below.

Again, it looks like there are generally more references in 2010 than in 2004.  But is there a way to quantify it?  In their paper, “Critical Values for the One-Sided Two-Sample Kolmogorov-Smirnov Statistic,” Mitchell H. Gail and Sylvan B. Green say:

The Kolmogorov-Smirnov one-sided two-sample statistic is used to test the null hypothesis F = G against the alternate F > G where F and G are distribution functions.  If the random variables X and Y correspond to F and G, respectively, then the one-sided alternative is that Y is stochastically greater than X.  For example, one is often interested in the one-sided alternative that survival times (Y) with a new medical treatment are longer than survival times (X) with conventional therapy.  The two-sided alternative is of less interest in this case.

This seems perfect, so I ran the one-sided two-sample Kolmogorov-Smirnov test on the expurgated data, using matlab.  In particular with the command:

 [h,p] = kstest2(d_2010,d_2004,0.05,'smaller')  

I got the result h = 1 and p = 0.0380.  This implies that there is a statistically significant difference between the two distributions, and in particular theses from 2010 have more references than theses from 2004.

You might be worried that the Kolmogorov-Smirnov test is designed for continuous-valued data rather than discrete-valued data as here, but don’t be alarmed.  As noted by W. J. Conover in a paper “A Kolmogorov Goodness-of-Fit Test for Discontinuous Distributions“:

Studies of the Kolmogorov test with discontinuous distributions appear to be quite limited.  The Kolmogorov test is known to be conservative if F(x) is discrete.

and by G. E. Noether in a paper “Note on the Kolmogorov Statistic in the Discrete Case“:

A simple demonstration of the conservative character of the Kolmogorov test in the case of discrete distributions is given.

But even with this assurance, one might wonder why the 95% confidence value is the right one to choose or even whether the p-value approach to statistics is the right one overall. 

On the 0.95 value, I was reading a paper by Simon that tried to quantify what regular people use as their standard of proof, e.g. in jury trials.  She concluded that the standard of reasonable doubt was between 0.70 and 0.74, so maybe 0.95 is not the most appropriate choice.  What do you think?

On the null hypothesis approach to statistics, I’ve recently been learning more about Bayesian data analysis methods, and am starting to feel that they might be a better way to go, but I am still not sure.  What is your take?

Notwithstanding, standard statistics do have really neat results, e.g. the Dvoretzky–Kiefer–Wolfowitz inequality, which somehow doesn’t seem to be used as much as it could be.  Are there any standard machine learning applications that you are aware of?


Da Bears

August 12, 2011

That was some good stuff in your previous post.  Yeah, it is amazing how quickly scientific papers are generated in the worldwide system.  I believe this causes a certain sense of information overload that many people feel.  Moreover, not only are there an increasing number of papers, but there has also been an emergence of putatively new fields of study, like synthetic biology, connectomics, and service science.  Of course it remains to be seen whether these fields remain viable or whether they collapse.

I was reading this paper “Google Effects on Memory: Cognitive Consequences of Having Information at Our Fingertips” by Sparrow, Liu, and Wegner that recently appeared in Science.  As the authors say, the internet has caused a huge technological shift in how information can be used:

In a development that would have seemed extraordinary just over a decade ago, many of us have constant access to information. If we need to find out the score of a ball game, learn how to perform a complicated statistical test, or simply remember the name of the actress in the classic movie we are viewing, we need only turn to our laptops, tablets, or smartphones and we can find the answers immediately. It has become so commonplace to look up the answer to any question the moment it occurs that it can feel like going through withdrawal when we can’t find out something immediately.

Moreover, they go on to describe experiments that demonstrate how technology has changed the nature of human cognition itself.  They essentially demonstrate that “our internal encoding is increased for where the information is to be found rather than for the information itself.”  They discuss their results as follows:

These results suggest that processes of human memory are adapting to the advent of new computing and communication technology. Just as we learn through transactive memory who knows what in our families and offices, we are learning what the computer “knows” and when we should attend to where we have stored information in our computer-based memories. We are becoming symbiotic with our computer tools, growing into interconnected systems that remember less by knowing information than by knowing where the information can be found. This gives us the advantage of access to a vast range of information, although the disadvantages of being constantly “wired” are still being debated.

Maybe the Google Scholar generation is fine with the growing volume of information due to these cognitive changes.  From introspection, I certainly feel that I don’t know all that much in the scientific literature, but rather I either know where to find it or feel confident that I could search for it if I needed to.  Recently I’ve slowly been learning a more empirical approach to life, so perhaps I should offer more evidence than simply introspection.

On the plane ride back from St. Petersburg, I was telling someone that I probably have a much more “referency” writing style than others.  (Although one might cynically feel that references are a way to show off erudition, in my approach to writing I include references because I know things “by reference” and also to give proper attribution.)  To test this hypothesis using scientometrics, I went through all 98 doctoral theses in EECS at MIT from the year 2010.  Although I am definitely in the top three, with 371 references, I do not hold the top spot.  That distinction is held by Umit Demirbas with his thesis on Low-cost, highly efficient, and tunable ultrafast laser technology based on directly diode-pumped Cr:Colquiriites, which has 416 references.  Your thesis comes in at #5 with 230 references, so it seems that you too have something of a referency writing style.  As a point of comparison on the other side, Mike Rinehart’s thesis on The value of information in shortest path optimization has 19 references.

To make an actual generational argument though, perhaps I should get the numbers from another year, like 2004.  Anyone up for crowdsourcing the data collection?  Also what would be an appropriate statistical test for me to look up to make such an argument?

Shifting gears to the start of the football season, in a certain sense I am more intrigued by UConn than by Syracuse itself.  Of course this is primarily due to Paul Pasqualoni and George DeLeone, a chief architect of the freeze option offense.  People often used to say that the Syracuse football playbook was too big and confusing; certainly larger than at other schools.  Coming back to the question of too much or too little, I wonder if there is a way to make an argument about the pluses and minuses of strategic complexity in a competition like football with bounded agents; perhaps following the lines of Daskalakis?

Anyway, let me leave it there and not mention anything about cognitive history or the difficulty of ranking multivariates, or my new found fear of non-human hominids.



August 8, 2011

Welcome home Señor Rasputin the mad monk. How was St. Petersburg?

In the Klosterman article, I liked how Ato Boldon said, “sprinters believe that — someday — somebody will run the 100 meters and the clock will read 0.00.”  Related to scientific progress, modeling records, and crowding, I recently came across two passages.  First, from the Financial Times:

Da Vinci was able to achieve so much, so broadly, because so little was known. It was possible to make leaps forward in scientific understanding armed with little more than a keen eye and a vivid imagination. Those times are long gone. Approximately 3,000 scientific articles are published per day – roughly one every 10 seconds of a working day. We can now expect that these papers will, each year, cite around five million previous publications. And the rate of production of scientific papers is quadrupling every generation. The percentage of human knowledge that one scientist can absorb is rapidly heading towards zero.

Second, from the IEEE Spectrum:

Given any prospective problem, a search may reveal a plethora of previous work, but much of it will be hard to retrieve. On the other hand, if there is little or no previous work, maybe there’s a reason no one is interested in this problem. You need something in between. Moreover, even in defining the problem you need to see a way in, the germ of some solution, and a possible escape path to a lesser result, like the runaway truck ramps on steep downhill highways.

Timing is critical. If a good problem area is opened up, everyone rushes in, and soon there are diminishing returns. On unimportant problems, this same herd behavior leads to a self-approving circle of papers on a subject of little practical significance. Real progress usually comes from a succession of incremental and progressive results, as opposed to those that feature only variations on a problem’s theme.

You asked if there is some distribution that would model scientific progress.  I don’t think that the modeling would be much different, even including crowding, from other types of models in the theory of records described in the book by Barry Arnold et al. you linked to, with one qualification.  How do you quantify scientific progress?  It is not simple to measure like sprints or floods.  (The Eurekometrics plots come with the qualification that they are of “areas where discovery – not simply scientific output – is well-defined and may be easily quantified.”)

By the way, Barry Arnold was Bill Hanley‘s advisor.