Archive for June, 2010

h1

Yet

June 27, 2010

Señor Redcardo Clark Schweinsteiger, I was also saddened to hear of the passing of Manute Bol.  The poster in our elementary school gym that still resonates with me is the one that said “Don’t say I can’t, say I can’t yet.”

Here are some two-dimensional embeddings (PCA left, Isomap right) of FIFA World Cup squad heights, first taking features to be the counts of the discrete-valued heights (there are 36 different heights in the rosters; these features are essentially probability mass functions)

and second, taking features to be cumulative distribution functions
.
With the cdf features, in both embeddings the first component (the dimension shown left to right) is essentially the average height.  I have no interpretation for the embedded pdf features.  The Isomap embedding with the cdf features shows interesting structure, although I don’t see anything related to team performance.  Working with the cdf features seems to make better sense than working with the pdf features.

The origins of statistics in biometry, eugenics, etc. is certainly an interesting history.  I recently attended a seminar given by Sastry Pantula in his role as president of the American Statistical Association (ASA), in which he talked about growth, impact, visibility, and education initiatives of the association.  One interesting point that came up was that professional statisticians are discouraging the AP Statistics course offered in high schools.  Also, apparently, Pantula shows an IBM commercial whenever he gives his spiel about the ASA, a commercial like this one:

Statistics has really come a long way since its origins and some of it is very mathematical—some would say too mathematical.  Regarding statistics, Jerome Friedman wrote more than a decade ago, “If we are to compete with other data related fields in the academic (and commercial) marketplace, some of our basic paradigms will have to be modified. We may have to moderate our romance with mathematics. Mathematics (like computing) is a tool, a very powerful one to be sure, but not the only one that can be used to validate statistical methodology. Mathematics is not equivalent to theory, nor vice versa. Theories are intended to create understanding and mathematics, although quite valuable, is not the only way to do this. For example, the germ theory of disease (in and of itself) has little mathematical content, but it leads to considerable understanding of much medical phenomena. We will have to recognize that empirical validation, although necessarily limited (as is mathematics), does constitute a form of validation.”  Interesting stuff that I agree with.  It seems to me that the new field going by the name analytics is taking over the commercial marketplace (if it hasn’t already).  For example, Steve LaValle recently said that “in top-performing organizations, analytics has replaced intuition as the best way to answer questions.”

There is another related romance that I think has blossomed in fields such as information theory, theoretical computer science, signal processing, and statistics, which is described here by several prominent theoretical computer scientists.  “1. Assignment of little weight to ‘conceptual’ considerations, while assigning the dominant weight to technical considerations. 2. The view that technical simplicity is a drawback, and the failure to realize that simple observations may represent an important mind-switch that can pave the way to significant progress.”

Is this sort of thing an inevitable part of the evolution of fields of study?  For any field could you say that it hasn’t happened, or only that it hasn’t happened yet?

h1

Anthropometry

June 24, 2010

I was saddened to hear of the recent death of Manute Bol.  As you know, he was a 7’7″ member of the Dinka tribe who played for several years in the NBA.  He devoted nearly all of his resources in trying to help his native Sudan (perhaps in a misdirected way, but that is a discussion for another forum) and died essentially in poverty.  As you may recall, there was a life-size poster of Bol in the gym at our elementary school.  Although he was the tallest professional hockey player of all time, he was not close to being the tallest person ever.

I think we both saw the episode of The Amazing Race that featured He Pingping and Bao Xishun, at the time the world’s shortest and tallest men.  He was 2’5″ whereas Bao is 7’9″, a difference of 5’4″.  (He also recently died, whereas Bao lost the title to Sultan Kösen.)

The range of human dimensions is just amazing, isn’t it?  And that isn’t even considering the possibility of other hominid species like the hobbits I had mentioned previously.  As noted on p. 358 of A Short History of Nearly Everything by Bill Bryson, Linnaeus “made room for mythical beasts and ‘monstrous humans’ whose descriptions he gullibly accepted from seamen and other imaginative travelers.  Among these were a wild man, Homo ferus, who walked on all fours and had not yet mastered the art of speech, and Homo caudatus, ‘man with a tail.'”  Bryson goes on to say on p. 368 that the world “is actually enormous—enormous enough to be full of surprises.  The okapi, the nearest living relative of the giraffe, is now known to exist in substantial numbers in the rain forests of Zaire—the total population is estimated at perhaps thirty thousand—yet its existence wasn’t even suspected until the twentieth century” going on to further describe the large flightless bird from New Zealand called the takahe and the Tibetan breed of horse called the Riwoche.  I do wonder if there are other hominid species around, but we just don’t know about them.

Coming back to our own species, there are whole books on the topic of human body size variation and the advantages of being short and the advantages of being tall.  The principle of allometric scaling, which we previously discussed here, is used in many of the arguments.  The most easily accessible examples are in sports.  For example, wikipedia argues that different positions in soccer have different optimal heights.  Since the World Cup is going on these days, there is an interesting opportunity to see if the height distribution of a team is predictive of performance.  Squad lists for each of the 32 teams, including the heights of each player, are provided on the FIFA website.  For example, the 23 players on the American side have the following heights (in cm): [187, 183, 184, 185, 192, 168, 170, 185, 178, 173, 178, 175, 178, 185, 180, 168, 185, 193, 183, 175, 193, 175, 192].  Even if the height distribution turns out not to be predictive of much, I think just the mapping of the height distributions would be very interesting.  I know you have some interest in information geometry, so this would be a kind of “demographic information geometry.”  It might also play well into your interests in dimensionality reduction since the map would hopefully not be in a high-dimensional space.  Also, the height distribution should probably be treated as a multiset rather than as a sequence (or perhaps a partially ordered set, using the players’ positions).

Readers of this blog might wonder why such fascination with anthropometry.  Besides the fact that is just generally interesting, that it may have applications in medicine, that it may be useful in determining genotype-phenotype relationships from genome wide association studies, and that it is useful in an engineering sense for the clothing industry, it is actually historically a topic that led to foundational developments in statistics (for estimation rather than detection as you detailed previously).  You had mentioned the work of Mahalanobis, but it seems that Galton was previously inspired to come up with the concept of correlation due to his anthropometric studies.

The traditional approach to anthropometric surveys has been to make some number of physical measurements on a large number of people.  For example, 240 body part measurements on the United States Army.  A more modern approach is to use three-dimensional whole-body scanners; this was used in two large scale surveys in Britain and the US, both sponsored by industry.  There seem to be all kinds of sampling questions though.  The first is the sampling of subjects and is a classic topic in statistics.  The second more interesting one, however is the sampling in the scan itself, a signal processing topic.  Is there a Nyquist-like or FRI-like sampling theorem for the signal class of human bodies?  How many measurements are needed to be able to reconstruct a human’s physical dimensions?  Slightly more easily than a full sampling theorem, due to scaling, is it possible to recover certain body measurement functionals from a set of other body measurement functionals?  There have actually been several weird studies along these lines, using simple correlation statistics that even Galton would recognize.

One use of all this anthropmetric data is to come up with standardized clothing sizes, a problem of categorization and quantization.  In men’s clothing, the names of the sizes often correspond to some physical measurement such as the lengths of the waistband and inseam for pants.  In women’s clothing, however, the names of the sizes are on an arbitrary numerical scale, which leads to all kinds of problems like vanity sizing.  In either of these cases, however, one or two numbers cannot be enough to describe the cut.  Perhaps a sampling and quantization theory of anthropometry will allow us all to have tailor-made clothes.

As a closing thought, let me congratulate someone who recently had a son that was 7 lbs. even and 1’9″ tall at birth.  Apparently an average North American newborn baby is 7.5 lbs. and 1’8″ tall, with 95 percent of full-term newborns being 1’6″ to 1’10” inches tall and weighing 5.5 to 10 lbs, so right on.

h1

ROC

June 9, 2010

Señor Jaime Oncins, there is an interesting history to that philosophical distinction related to alternative hypotheses, which I first became aware of when reading this paper by Clayton Scott and Rob Nowak.

As Lehmann describes, there was apparently a huge ongoing fracas between Fisher, whose view included p-values and all the stuff that science researchers worry about, and Neyman, whose view included type I errors (false alarms) and type II errors (missed detections) and all the stuff that radar engineers worry about.  (Apparently both of them hated Bayesian hypothesis testing.)  Here’s a passage from Lehmann: “Neyman did not believe in the need for a special inductive logic but felt that the usual processes of deductive thinking should suffice.  More specifically, he had no use for Fisher’s idea of likelihood.  In his discussion of Fisher’s 1935 paper (Neyman, 1935, p. 74, 75) he expressed the thought that it should be possible ‘to construct a theory of mathematical statistics … based solely upon the theory of probability,’ and went on to suggest that the basis for such a theory can be provided by ‘the conception of frequency of errors in judgment.’  This was the approach that he and Pearson had earlier described as ‘inductive behavior’; in the case of hypothesis testing, the behavior consisted of either rejecting the hypothesis or (provisionally) accepting it.”

I personally feel much more comfortable with the Neyman view and associated objects such as receiver operating characteristics (ROCs), and get an uneasiness in my stomach when t-tests and p-values are bandied about.

As you intimated right after mentioning the philosophical distinction, some work that I did recently with Ryan Prenger, Tracy Lemmond, Barry Chen, and Bill Hanley considers false alarms, missed detections, and ROCs.  Our paper entitled Class-Specific Error Bounds for Ensemble Classifiers, which will be presented in July at the KDD conference, develops loose but highly predictive generalization error bounds for false alarms and missed detections at all operating points for ensemble classifiers such as random forests.  These Prenger bounds provide guidelines for how to push up the ROC to reduce missed detections in the ultra-low missed detection regime when missed detections are really costly or to push out the ROC to reduce false alarms in the ultra-low false alarm regime when false alarms are really costly.

There is some p-value stuff going on in our KDD paper,  but it does not have a statistically significant effect on my stomach uneasiness.