A friend of the blog was recently asking both of us how to cluster time series (of possibly different lengths), and in response to that query I had looked at the paper “A novel hierarchical clustering algorithm for gene sequences” by Wei, Jiang, Wei, and Wang who are bioinformatics researchers from China and Canada. The basic idea is to generate feature vectors from the raw time series data, define a distance function on the feature space, and then use this distance measure to do (hierarchical) clustering. At the time, I also flipped through a survey article on clustering time series, “Clustering of time series data—a survey” by Liao who is an industrial engineer from Louisiana. As he says, the goal of clustering is to identify structure in an unlabeled data set by organizing data into homogeneous groups where the within-group-object similarity is minimized and the between-group-object dissimilarity is maximized and points out five broad approaches: partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. There are of course applications in all kinds of fields, such as biology, finance, and of course social media analytics where one might want to cluster Twitter users according to the time series patterns of tweeting sentiment.
But any technique seems to require some notion of similarity to proceed. As Leslie Valiant says in his book, Probably Approximately Correct [p. 159]:
PAC learning as we have described it is a model of supervised learning. One of its strengths is that it is essentially assumption free. Attempts to formulate analogous theories for unsupervised learning have not been successful. In unsupervised learning the learner appears to have to make specific assumptions about what similarity means. If externally provided labels are not available, the learner has to decide which groups of objects are to be categorized as being of one kind, and which of another kind.
I hold the view that supervised learning is a powerful natural phenomenon, while unsupervised learning is not.
So maybe clustering is not a powerful natural phenomenon (but would Rand disagree?), but I’d like to do it anyway. As some say, clustering is an art rather than a science, but I like art, don’t you? In some sense the question boils down to developing notions of similarity that are appropriate. Though I must admit I do have some affinity for the notion of “natural kinds” that Bowker and Star sometimes talk about when discussing schemes for classifying various things into categories.
Let me consider a few examples of clustering to set the stage:
- When trying to understand the mapping between neural activity and behavior, is it important to cluster video time series recordings of behavior into a discrete set of “behavorial phenotypes” that can then be understood. This was done in a paper by Josh Vogelstein et al., summarized here. An essentially Euclidean notion of similarity was considered.
- When trying to understand the nature of the universe and specifically dark matter, a preprint by my old Edgerton-mate Robyn Sanderson et al., discusses the use of the Kullback-Leibler divergence for measuring things in a probabilistic sense, without having to assert the notion of similarity too much in the original domain.
- To take a completely different example, how might people in different cultures cluster colors into named categories? In fact this has been studied in a large-scale worldwide study, which has made the raw data available. How does frequency become a categorical named color, and which color is most similar to another?
Within their domains, these clusterings seem to be effective, but is there a general methodology? One idea that has been studied is to ask people what they think of the results of various formal clustering algorithms, a form of anthropocentric data analysis, as it were. Can this be put together algorithmically with information-theoretic ideas on sampling distortion functions due to Niesen and all?
Another idea I learned from Jennifer Dy, who I met in Mysore at the National Academy of Engineering‘s Indo-American Frontiers of Engineering Symposium last month, is to actually create several different possible clusterings and then let people decide. A very intriguing idea.
Finally, one might consider drawing on universal information theory and go from there. A central construct in universal channel coding is the maximum mutual information (MMI) decoder, which doesn’t require any statistical knowledge, but learns things as it goes along. Misra and Weissman modified that basic idea to do clustering rather than decoding, in a really neat result. Didn’t make it into Silicon Valley, as far as I can tell, but really neat. Applications to dark matter?
You are currently en route to Australia to, among other things, present our joint work on olfactory signal processing at the IEEE Statistical Signal Processing workshop. One paper on active odor cancellation, and the other on food steganography. Do let me know of any new tips or tricks you pick up down under: hopefully with some labels rather than forcing me to do unsupervised learning. Also, what would you cluster angelica seed oil with?