Archive for May, 2010

h1

Simultaneity

May 27, 2010

I know you were being a little bit facetious in elements of your previous post, but when one writes a paper on human history, like this one on reconstructing Indian population history or the one on Neanderthals that I had mentioned, it seems that societal reaction cannot be neglected.  Indeed the reason for interest in human history is exactly because it tells us where we came from; the most compelling stories are often the ones of our own past.  As such, I think authors may feel an even higher burden of statistical certainty than they otherwise might, though of course one should always maintain high standards of proof for scientific statements.

Incidentally, did I ever tell you that Nick Patterson from the Broad Institute, who closely collaborates with the aforementioned David Reich, reminds me in many ways of James Burke.

When I was at Janelia Farm, I had gone to a lunchtime presentation about statistical methods for experiments involving multiple comparisons.  It was interesting in its own right, discussing techniques that I myself have used for motif detection, and something that our father has some recent interest in.  The other interesting thing, however, was the strong emphasis on ruling out null hypotheses.  When classical statistics are used for scientific investigations, the use of controls and the notion of null hypotheses seem to be a huge deal.  On the other hand, in traditional applications of statistical signal processing there seem not to be any privileged hypotheses.  Since you are more closely involved in the statistics and statistical signal processing communities than I am, perhaps you have more insight into this philosophical distinction.  I know you’ve done some work on detection in settings where false alarms are much more troubling missed detections (and not just referees that swallow whistles).

Anyway, let me come back to the Neanderthal paper, as you had asked me to.   Rather than recapitulate the main results of the paper, which have been well-described by many other commentators, let me focus on a couple interesting little pieces.  The first is phenomenon of unidirectional gene flow and the concept of gene surfing.  As Green et al. say, “we detect gene flow from Neandertals into modern humans but no reciprocal gene flow from modern humans into Neandertals.”  Now this seems very puzzling: how can gene flow not be bidirectional?  They go on to explain this by essentially invoking facts about population dynamics: “it has been shown that when a colonizing population (such as anatomically modern humans) encounters a resident population (such as Neandertals), even a small number of breeding events along the wave front of expansion into new territory can result in substantial introduction of genes into the colonizing population as introduced alleles can ‘surf’ to high frequency as the population expands. As a consequence, detectable gene flow is predicted to almost always be from the resident population into the colonizing population, even if gene flow also occurred in the other direction.”  So it is all about detectability.  Moreover, “another prediction of such a surfing model is that even a very small number of events of interbreeding can result in appreciable allele frequencies of Neandertal alleles in the present-day populations.”  So rare events can have huge impacts on what is observable.

Another piece that you might find interesting is found in the online supplementary material, and in particular part 14 on Date of population divergence between Neandertals and modern humans.  The goal is to put an absolute time scale on evolutionary events such as splits in the tree from the genetic data itself.  I won’t go into the details, but surprisingly, parameters such as mutation rate and generation time are not required because they cancel out; only one calibration date is needed.  (Obviously the disembodiment of life implied by artificial insemination and the dissociation from physical time and space that it allows are not considered).

In closing this post, let me just say that I find it very hard to imagine what it would be like with several hominid species walking the earth simultaneously.  Not just the Neanderthals, but also the hobbits Homo floresiensis (if they are actually a separate species), and others.  [As you had pointed out, interpreting archeological evidence is difficult, so it is unclear whether H. floresiensis is distinct.  Brain size and scaling laws are the basis for several of the arguments both for and against.]  Would there be positive social interaction or would there be deep mistrust?  Would interbreeding result from acts of war?  Perhaps each would see the other as a curiosity.  Hard to know.

h1

Vānaras

May 9, 2010

Interesting Señor Jaime Yzaga.  Please do come back to this.  It’s not everyday that one knows one of the authors of a study generating such popular buzz.

One part of the New York Times article that I thought was interesting was this: “archaeologists questioned some of the interpretations put forward by Dr. Paabo and his chief colleagues, Richard E. Green of the Leipzig institute, and David Reich of Harvard Medical School. Geneticists have been making increasingly valuable contributions to human prehistory, but their work depends heavily on complex mathematical statistics that make their arguments hard to follow. And the statistical insights, however informative, do not have the solidity of an archaeological fact.”  I don’t think archaeology is any more factual than other sciences, and all involve interpretation.  Understanding a civilization from few artifacts involves interpretation.  The less data you have, the more interpretation or ‘inductive bias’ you need. 

In my last post, when I said “I’m not typing up all those numbers,” I was being facetious, but my statement also highlights the difference between today and yesterday.  As the video I linked to in the post said, “Trillions of digital devices, connected through the Internet, are producing an ocean of data.”  Getting data is much easier now than before (although still not easy).  If you have tons of data, you can do remarkable things through simple methods.  Also the contrapositive: you can’t do much if you don’t gots much data.

And now, some Neanderthal interpretation from basically no data: a conjecture just for fun.

According to tradition, the alliance between Rāma and the strong, spear and stone-wielding, cave-dwelling vānaras was forged more than 869,000 years ago.  This age is of the same order of magnitude as when humans and the strong, spear and stone-wielding, cave-dwelling Neanderthals contemporaneously walked the earth.  Neanderthal range didn’t extend into the subcontinent, but just reached as far as Afghanistan.  And you know what – as Ram Sharan Sharma writes, “but we do not find even a modest settlement at Ayodhya until 500 BC, as is true of the whole mid-Ganga plain. Because of this difficulty some scholars locate the original Ayodhya in Afghanistan …”  In fact, the entire Rāmāyaṇa probably took place northwest of the subcontinent.  It isn’t hard to imagine the vānara senā being an army of Neanderthals, is it? 

Lav, do you think this conjecture will attract a rant from a hindutvadi commenter?  Also, did you know that if you can find the right journal, even a paper like this is publishable? 

h1

Neanderthals

May 7, 2010

I’m at Janelia Farm at the moment and one of the topics of discussion at lunchtime was the new paper on the Neanderthal genome.  Someone else also pointed out an article in The New York Times that discusses some of the findings.  One of the main contributors to the statistical analysis used to argue that Neanderthals interbred with modern humans was David Reich, who I had spent some time with in the fall.  The sequencing was carried out at the Max Planck Institute for Evolutionary Anthropology, which was featured in the human spark documentary.

Let me leave it that for now, and come back to this after I have actually read the paper.  Interestingly, though, let me note that one of the things I had asked Reich was whether the Mahalanobis dataset that you had described could be used together with Indian genomic datasets to say something interesting.

h1

Data Yes. Wisdom No.

May 5, 2010

Hymie, this post is brought to you by the decade of smart. 

Rating feature importance is tricky business.  With the sparrows, I did a least squares linear fit of the logarithm of weight with the logarithm of each of the length and width features (albeit a bit suspect statistically), finding the coefficient of the linear term to be 0.23 for total length, 0.23 for alar extent, 0.21 for length of beak and head, 0.29 for length of humerus, 0.26 for length of femur, 0.29 for length of tibiotarsus, 0.21 for width of skull, and an outlier 0.43 for length of keel of sternum.  The isometric relationship should be 1/3, so all of these measurements have negative allometry except for one. 

You asked about principled methods for commensurating features.  I take it that you’re talking about how to do simple scalings of feature dimensions.  In fact, standard classification trees with axis-aligned splits and random forests built from them are not affected by scaling dimensions.  However, other pattern recognition methods most surely are.  I may be wrong, but it is my impression that there are few existing principles.  I invite any reader who knows otherwise to share that with Hymie and me.  A related thing that has only recently started receiving attention is automatically learning kernels

Anyways, I divided the length features by weight raised to the exponent that was fit and obtained random forest out-of-bag feature importance values as before.  Total length seemed to again be confounding matters, so I took it out.  Here are the importances. 

The Bumpus sparrow data hasn’t provided me with any wisdom; it’s just too tricky.  There’s a large dataset of body part measurements of beings that possess the human spark that I think would be interesting to look at: the Anthropometric Survey of the United Provinces (1941).  Unlike sparrow survival, however, it is not clear what the response variable to be examined is.  Maybe some unsupervised analysis is the way to go, but I’m not typing up all those numbers.

Also, I’m done with the birds.  Nevermore.