June 28, 2014

A friend of the blog was recently asking both of us how to cluster time series (of possibly different lengths), and in response to that query I had looked at the paper “A novel hierarchical clustering algorithm for gene sequences” by Wei, Jiang, Wei, and Wang who are bioinformatics researchers from China and Canada.  The basic idea is to generate feature vectors from the raw time series data, define a distance function on the feature space, and then use this distance measure to do (hierarchical) clustering.  At the time, I also flipped through a survey article on clustering time series, “Clustering of time series data—a survey” by Liao who is an industrial engineer from Louisiana.  As he says, the goal of clustering is to identify structure in an unlabeled data set by organizing data into homogeneous groups where the within-group-object similarity is minimized and the between-group-object dissimilarity is maximized and points out five broad approaches: partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods.  There are of course applications in all kinds of fields, such as biology, finance, and of course social media analytics where one might want to cluster Twitter users according to the time series patterns of tweeting sentiment.

But any technique seems to require some notion of similarity to proceed.  As Leslie Valiant says in his book, Probably Approximately Correct [p. 159]:

PAC learning as we have described it is a model of supervised learning.  One of its strengths is that it is essentially assumption free.  Attempts to formulate analogous theories for unsupervised learning have not been successful.  In unsupervised learning the learner appears to have to make specific assumptions about what similarity means.  If externally provided labels are not available, the learner has to decide which groups of objects are to be categorized as being of one kind, and which of another kind.

I hold the view that supervised learning is a powerful natural phenomenon, while unsupervised learning is not.

So maybe clustering is not a powerful natural phenomenon (but would Rand disagree?), but I’d like to do it anyway.  As some say, clustering is an art rather than a science, but I like art, don’t you? In some sense the question boils down to developing notions of similarity that are appropriate.  Though I must admit I do have some affinity for the notion of “natural kinds” that Bowker and Star sometimes talk about when discussing schemes for classifying various things into categories.  

Let me consider a few examples of clustering to set the stage:

  1. When trying to understand the mapping between neural activity and behavior, is it important to cluster video time series recordings of behavior into a discrete set of “behavorial phenotypes” that can then be understood.  This was done in a paper by Josh Vogelstein et al., summarized here.  An essentially Euclidean notion of similarity was considered.
  2. When trying to understand the nature of the universe and specifically dark matter,  a preprint by my old Edgerton-mate Robyn Sanderson et al., discusses the use of the Kullback-Leibler divergence for measuring things in a probabilistic sense, without having to assert the notion of similarity too much in the original domain.
  3. To take a completely different example, how might people in different cultures cluster colors into named categories?  In fact this has been studied in a large-scale worldwide study, which has made the raw data available.  How does frequency become a categorical named color, and which color is most similar to another?

Within their domains, these clusterings seem to be effective, but is there a general methodology?  One idea that has been studied is to ask people what they think of the results of various formal clustering algorithms, a form of anthropocentric data analysis, as it were.  Can this be put together algorithmically with information-theoretic ideas on sampling distortion functions due to Niesen and all?

Another idea I learned from Jennifer Dy, who I met in Mysore at the National Academy of Engineering‘s Indo-American Frontiers of Engineering Symposium last month, is to actually create several different possible clusterings and then let people decide.  A very intriguing idea.

Finally, one might consider drawing on universal information theory and go from there.  A central construct in universal channel coding is the maximum mutual information (MMI) decoder, which doesn’t require any statistical knowledge, but learns things as it goes along.  Misra and Weissman modified that basic idea to do clustering rather than decoding, in a really neat result. Didn’t make it into Silicon Valley, as far as I can tell, but really neat.  Applications to dark matter?

You are currently en route to Australia to, among other things, present our joint work on olfactory signal processing at the IEEE Statistical Signal Processing workshop.  One paper on active odor cancellation, and the other on food steganography.  Do let me know of any new tips or tricks you pick up down under: hopefully with some labels rather than forcing me to do unsupervised learning.  Also, what would you cluster angelica seed oil with?


Scaling Laws for Waste

April 13, 2014

It has been a long while since I last wrote a blog post.  In the meanwhile, a lot of things have happened, and apropos to the previous post, I have switched from the schedule of a research staff member to the schedule of a starting assistant professor.  Definitely some different elements to it.  More time teaching and less time working on a food truck, to say the least.

Since last I’ve posted, I’ve also gotten much more into tweeting, which perhaps does take away some of my impetus for blogging: limited attention, information overload, and all that.  As one of the great new communication media, I’m also getting interested in Twitter as an object of research study.  Indeed, some of my undergraduate researchers this summer will be looking at social media analytics.

Since the last post, I’ve also gotten further interested in resource recovery and other problems of environmental engineering, though not at all versed in the subject yet. One of the most valuable resources from which to recover energy, nutrients, water, and solids is animal waste.  Indeed, there have even been wars over the control of guano.  

I’ve had some longstanding interest in allometric scaling laws for various things, and I suppose I’ve made you at least somewhat interested.  When I was visiting Santa Fe last summer, I feel like my interest in this topic was renewed, largely due to the enthusiasm of Luis Bettencourt on scaling laws for cities.  As it turns out, there are a lot of parallels to neurobiological scaling.

With all that as preface, do you have any idea how the amount of waste produced by an animal scales with the size of the animal?  Do you think it would be allometric scaling?

In fact this question for urine has been studied in the literature more extensively than I would have expected.  In the paper, “Scaling of Renal Function of Mammals,” Edwards takes data on the mass [kg] and the urine volume [mL/24 hours] for 30 mammalian species and finds an allometric relation with power law exponent 0.75, which is the same power-law exponent as for metabolic rate as given by Kleiber’s Law. (One theoretical derivation based on elasticity is due to McMahon.)

The same urine volume exponent is presented in a paper, “Scaling of osmotic regulation in mammals and birds,” by Calder and Braun. Turning to water loss through feces, Calder and Braun say:

Fecal losses should, in absence of size-related differences in food quality, digestive efficiency, and/or reabsorption, scale in parallel to the intake that supplies metabolic requirements, but the only allometric expression we have found in the literature has M0.63 scaling [in mL/day].

where the scaling law is quoted from a paper by Blueweiss, Fox, Kudzma, Nakashima, Peters, and Sams, “Relationships between body size and some life history parameters,” from the journal Oecologia.  The original statement in that paper regarding defecation is measured in the units g/g/day and gives a power-law exponent -0.37 based on data from mammals, but this measure of [g/g/day] already normalizes once by body weight, which is why there is no issue, 1 – 0.37 = 0.63, assuming constant liquid content.  The original data used by Blueweiss, et al. is said to be from a paper by Sacher and Staffeldt, “Relation of gestation time to brain weight for placental mammals: implications for the theory of vertebrate growth,” though I didn’t see it in there.

A contributing factor to all of this is of course food intake, via assimilation efficiency of various foods.  In a paper “Allometry of Food Intake in Free-Ranging Anthropoid Primates” in Folia Primatologica, Barton reports the power-law exponent for daily intake in grams dry weight per 24 hours as probably a little bit more than 0.75 using limited data on 9 species of primates (including humans).  For cattle, Illius in a paper, “Allometry of food intake and grazing behaviour with body size in cattle,” talks about intake with exponent a little bit less than 0.75.

Having talked about urine and feces, what about CO2?  A paper “Direct and indirect metabolic CO2 release by humanity” by Prairie and Duarte quotes allometric laws on respiration and defecation from the book The Ecological Implications of Body Size by Peters, which maybe I should read.

So what does this have to do with information?  I wonder if there is a notion of information metabolism with an associated scaling law like Kleiber’s.  There is a notion of a ‘garbage tape’ in the thermodynamics of computation following Landauer, and so I wonder what fraction of information is put into the garbage tape as a function of the size of the computation.  

Anyway, good to get back into blogging, hopefully without too much garbage.  After all, we don’t want too much information pollution, nor municipal solid waste in cities for that matter.


Scheduling Time

November 3, 2013

One of the interesting things at IBM, at least for me, has been how common it is for people to use Lotus Notes to first see when people might be available, and then to schedule meetings.  Unfortunately there is no way to see why someone else has a slot blocked off; only that it is.  That is why I think an improvement would be to allow circles of visibility, so certain people can see why your calendar is blocked off and whether it seems possible to propose a shift.  Though maybe in hierarchical organizations, this would lead to others requiring membership in circles and imposing value judgments on commitments.

Of course the best for me is the graduate student approach of having very little of one’s time scheduled.  Not just from the scheduling viewpoint itself, but also in the sense of not being too busy to allow thinking time.  As some say, “People with loose, flexible schedules, on the other hand, seem pretty boss”.  A fairly thoughtful self-help (in contrast perhaps to much self-help on time management) article puts forth several good ideas succinctly: slow down, stop trying to be a hero, go home, minimize meetings, go dark, leave the office for lunch, give up on multitasking, and say no.  Doing these things seems to make things nice and slow.

Although some of my work these days is about managing, there is also an element of making things.  That is why I find this article about the difference between the manager’s schedule (which is often partitioned into one-hour blocks) and the maker’s schedule (for people like programmers and writers that generally prefer to use time in units of half a day at least: units of an hour are barely enough time to get started writing) so insightful.  As is discussed:

When you’re operating on the maker’s schedule, meetings are a disaster. A single meeting can blow a whole afternoon, by breaking it into two pieces each too small to do anything hard in. Plus you have to remember to go to the meeting. That’s no problem for someone on the manager’s schedule. There’s always something coming on the next hour; the only question is what. But when someone on the maker’s schedule has a meeting, they have to think about it.

For someone on the maker’s schedule, having a meeting is like throwing an exception. It doesn’t merely cause you to switch from one task to another; it changes the mode in which you work.

I suppose that is why many professors hide away at home or elsewhere when they want to get some serious thinking or writing done:  to avoid handling exceptions.  Can one obtain the managerial benefits of meetings without their negative impacts on making?

To address this, of course one needs to first determine why have meetings in the first place.  A central role of meetings is for coordination.  But perhaps this role of meetings can be eliminated if there is a possibility of ambient awareness.  This term is something that Clive Thompson uses in his book Smarter Than You Think.  As he says regarding meetings [p. 217]:

But younger workers were completely different.  They found traditional meetings vaguely confrontational and far preferred short, informal gatherings.  Why?  Because they were more accustomed to staying in touch ambiently and sharing information online, accomplishing virtually the tasks that boomers grew up doing physically.  Plus, the younger workers had the intuition—which, frankly, most older workers would agree with—that most meetings are a fantastic waste of time.  When they meet with colleagues or clients, they prefer to do it in a cafe, in clusters small enough—no more than two or three people—that a serious, deep conversation can take place, blended with social interaction, of a sort that is impossible in the classic fifteen-person, all-hands-on-deck conclave.

Besides ongoing coordination, though, another purpose of meetings is to perform planning in the first place.  Again, though, it raises the question of whether planning is really necessary.  For physical work requiring a great deal of equipment and lead time, planning seems required, but what about knowledge work?  In an article about Shannon, Bob Gallager essentially argues against too much planning, saying:

In graduate school, doctoral students write a detailed proposal saying what research they plan to do. They are then expected to spend a year or more carrying out that research. This is a reasonable approach to experimental research, which requires considerable investment in buying and assembling the experimental apparatus. It is a much less reasonable approach to Shannon-style research, since writing sensibly about uncharted problem areas is quite difficult until the area becomes somewhat organized, and at that time the hardest part of the research is finished.

And yet I did write a doctoral thesis proposal and do have ongoing coordination meetings.  It would be interesting though, if instead of a doctoral thesis proposal document, I had written ongoing doctoral thesis tweets.  We had previously discussed microblogging a little bit, but this ambient awareness concept of Thompson may enable making people aware of what is going on, without having to have meetings, and also let someone like me write about uncharted problems in somewhat unorganized ways.

Although we often thinking of having as many followers as possible as a goal, this may not be the best use of microblogging as a cognitive tool.  As Thompson says [p. 234]:

The lesson is that there’s value in obscurity.  People who lust after huge follower counts are thinking like traditional broadcasters.  But when you’re broadcasting, there’s no to and fro.  You gain reach, but lose intimacy.  Ambient awareness, in contrast, is more about conversation and co-presence—and you can’t be co-present with a zillion people.  Having a million followers might be useful for hawking yourself or your ideas, but it’s not always great for thinking.  Indeed, it may not even be that useful for hawking things.

Perhaps obsurity has some connection to allowing oneself to not do anything too? 

Anway, that was what I wanted to ramble about.  Perhaps I should have scheduled some time with you on Lotus Notes to review things and make sure I had a good plan for this blog post…


On Models and Block Diagrams

September 8, 2013

As you know, I’m really intrigued by creativity these days as are you.  I’ve recently been reading an excellent book about creativity by R. Keith Sawyer entitled, Explaining Creativity: The Science of Human Innovation, and related to the rural and urban that you discuss, he says the following:

Gardner’s research on the “exemplary creator” (1993) found that these exceptional individuals often grew up far from the cultural center, but before they made their most significant contributions, they moved to the center to master the domain and join the field’s networks.

Perhaps there is a cognitive development argument to be made about not being overly stimulated when one is young, but then having to learn what all has been done when one is ready to make contributions.  Anyway, one of the other interesting discussions in the book is about problem-finding as opposed to problem-solving.  As he says:

Some domains are fairly advanced and most of the important problems are well known to everyone.  Knowledge in the domain is well organized and well structured.  If you prefer a problem-finding style of creativity, then you’re likely to be frustrated in such a domain, because it needs problem solvers.  Problem-finding people are better off in domains where the most important issues are unresolved, where conventions and rules are not rigidly specified, where no one even knows where to start.  These tend to be relatively new areas of activity… If you prefer a problem-finding style of creativity, you’ll need to keep a broad watch on the society, looking for the next new thing.

In contrast, if you prefer a problem-solving style, then you’ll probably be happier in a mature domain that’s been around a while…. The questions are well known and the criteria for judging work are objective; everyone will know it when you come up with something new.  Many people prefer the certainty of such domains; in the more ambiguous problem-finding domains, the criteria for creativity are ill defined, and there may be subjective differences of opinion in what counts as good work.

As I like to argue about this kind of thing within an engineering systems theory, the question to me essentially reduces to whether there is a block diagram already or does one have to mathematize to get the block diagram. 


In his Shannon Lecture, David Slepian (at some point I need to learn more about the ghost army) took up how models relate to the real world, saying that it is a miracle that there is any correspondence between the two at all.  

If one thinks about biology rather than engineering systems theory, there is a strong tradition over the past six or seven decades of using model organisms like C. elegans or the mouse.  The idea is to learn as much as one can about a particular organism with the hope that it may provide insight into biological systems more generally.  I am no philosopher, but this model organism oriented method has something to do with nomothetic and idiographic approaches to knowledge.  What interests me is whether this method limits problem-finding and more importantly whether it limits what can be discovered.  There was a really nice sequence of articles a few years ago asking whether the focus on model organisms was impeding research into human disease, starting with why such models emerged in the first place.  I think all biologists should read it.  The author claims that “the mouse monopoly is teetering in the face of cheaper, faster genetic technologies” but I guess this is still to be seen.  I myself have studied C. elegans, but is it time to look at the Antarctic nematode or the naked mole rat?

Even if diversifying the set of organisms studied in biology didn’t advance understanding of human disease, I think it would definitely allow biology researchers to engage in more ‘problem-finding’ kinds of creativity, which I hope can only be a good thing.



July 21, 2013

Yes, I see that you are indeed writing a lot of conference papers recently.  Spreading the gospel from Boston to Bangalore to Beijing, in fairly diverse venues.  Now that you are becoming a star, I know you said you are having limited words to write with, but are you feeling any other deleterious effects?  “In a state of overload, cognitive limitations may constrain the value of a star’s social capital; if the information load goes unmanaged for long periods of time, the star may stumble and, ultimately, fall.”

Broadly, I think the scarcity of human attention is going to be a limiting factor for many aspects of society, especially in building sociotechnical systems, and it also suggests a wide variety of research questions.  In particular I think it gets to the heart of Thomas Malone’s question, “what are the conditions that lead to collective intelligence rather than collective stupidity”?

A few weeks ago, I was visiting the Santa Fe Institute and tried to make the case for human attention, but I’ll let you judge the strength of my argument.  

Incidentally, SFI is a really great place to visit: I had some nice serendipitous encounters.  Likewise with my recent trips to the Center for Nonlinear Studies at Los Alamos National Lab and to the Barabasi Lab at Northeastern.

When you were at the 7th International AAAI Conference on Weblogs and Social Media, you had sent me several papers that essentially advance this argument too.  One that I found particularly intriguing was about serendipity as present in microblogging platforms, which essentially proposes a need to balance surprise and relevance (quality).  Any serendipity for you from either epidemiology or poverty economics?

As you know, this balance between surprise and quality seems to be informing some of my own work recently, whether it is the balance between surprise and flavor in computational creativity for culinary recipes or the balance between surprise and information in communication.

I think the idea that social norms make life much more predictable (rather than surprising) is also an interesting one.  As F. A. Hayek writes in his classic book Individualism and Economic Order:

Quite as important for the functioning of an individualist society as these smaller groupings of men are the traditions and conventions which evolve in a free society and which, without being enforceable, establish flexible but normally observed rules that make the behavior of other people predictable in a high degree. The willingness to submit to such rules, not merely so long as one understands the reason for them but so long as one has no definite reasons to the contrary, is an essential condition for the gradual evolution and improvement of rules of social intercourse; and the readiness ordinarily to submit to the products of a social process which nobody has designed and the reasons for which nobody may understand is also an indispensable condition if it is to be possible to dispense with compulsion. That the existence of common conventions and traditions among a group of people will enable them to work together smoothly and efficiently with much less formal organization and compulsion than a group without such common background, is, of course, a commonplace. But the reverse of this, while less familiar, is probably not less true: that coercion can probably only be kept to a minimum in a society where conventions and tradition have made the behavior of man to a large extent predictable.

I certainly find adherence to social norms makes life easier, but perhaps there is some balance between surprise and quality in social norm formation too? What impact do you think the emergence of global superstars will have on social norms?



July 6, 2013

Señor Jonathan Borlée, it has been a long time since I’ve picked up the metaphorical pen and put something up on the blog.  I think the statement that there are only so many words a person can write in a given time period does apply to me.  However I don’t think it is the microblogging that has been consuming my words; it has been the words in academic conference papers—I’ve never written so many in a short period of time.  Additionally, conscientiously completing two MOOCs (one on epidemiology and the other on poverty economics) required a consistent effort.  I also wonder what effect getting a smartphone has had on me.

“Many of us no longer think clearly,” insists Silicon Valley futurist Alex Soojung-Kim Pang, because of our compulsive attachment to the digital world.

You talked about leading a creative life.  One view on creativity is: calvinandhobbes1Although I don’t necessarily agree with the punchline, I do agree that there is a certain mood required for creativity.  I think that those three points of disconnecting, delving into the past, and being masterful have to be satisfied to get in the mood.  Thanks to your having me over for a couple of days over the holiday, I was able to disconnect and get into the mood to reconnect with the blog.  (It is amazing how watching back-to-back-to-back episodes of King & Maxwell and hour-upon-hour-upon-hour-upon-hour-upon-hour of coverage from the All England Club can allow me to disconnect.)

So what of delving into the past?  I read Friedman’s The World is Flat: A Brief History of the Twenty-First Century this year, eight years after it came out.  I think that now was the right time for me to read it because it is only now that I am beginning to appreciate what globally integrated enterprises, supply chain management including human capital supply chain management, knowledge work, and the services economy means.  The flat world has enabled Walmart to be the one and only dominant force.  Ames, Hills, Woolworth, Zayre, and their ilk are all gone and Kmart probably will be too eventually.  In entertainment, Matt Allen, Munich, and Mumbai alike were mad for and then mourned for a single superstar: the king of pop.  Just like video killed the radio star, global integration is killing any star but the one superstar.

By making the entire world a single niche, technology is fanning the superstar effect: in sports, labor, and really everywhere.  I don’t doubt that the same will happen with higher education too: MOOC superstars will be the only ones left educating.

People who lose to the superstar effect are certainly the second best and third best performers, but more so the (mostly rural) segment of the population cut off and disconnected from the superstar.

It is a simple and undisputed principle of development theory that rural incomes simply cannot go up much if villages are not meaningfully connected to the city. No society has eve[r] been economically transformed without that link. Not connecting villages and cities in a mutually beneficial manner is a sure way to hurt the village. Trade and transport are two of the best ways known for creating urban-rural links.

So what of being masterful and connecting different bodies of knowledge in new ways?  Maybe Kṛṣṇa’s youth still has some lessons to teach.  His lifting of Govardhana made him a superstar (leaving Indra and the other devas to be the Phil Mickelsons and Vijay Singhs of religion).  He also has an urban-rural duality to him, having been born a prince in the city, raised in rural lands, and returned to the urban world.

The superstar effect cannot be stopped, nor should it be.  I think the key for the future is to have some superstars from all segments.  The most valuable superstar may well turn out to be one who is or was at some point connected enough to delve in, disconnects, and then connects back to disseminate a creation.  Maybe jugaad innovation is a counterbalance to technology destroying jobs.



May 19, 2013

As noted in my previous post, you seem to be microblogging quite a bit these days.  I am strongly considering jumping on the bandwagon, but I’m not quite sure what to tweet.  Any suggestions?  Do you find the 140 character limit to improve your tweeting?  As part of my guiding philosophy I might use “May you enjoy the special pleasures of craft—the private satisfaction of doing a task as well as it can be done. @jeffreylehman #dirt” which has 135 characters, but this would violate “Minimize the use of aphoristic quotations. @jeffreylehman #OptimisticHeart”, which comes in at 74 characters.

I suppose one should consider the structure of what Twitter is.

As noted in Dhiraj Murthy‘s book, Twitter, [p. 6]:

This structure of channels and consumers of channels of information draws from notions of broadcasting.  Specifically, Twitter has been designed to facilitate interactive multicasting (i.e., the broadcasting of many to many)… Twitter encourages a many-to-many model through both hashtags and retweets.  A “retweet” (commonly abbreviated as “RT”) allows people to “forward” tweets to their followers and is a key way in which Twitter attempts to facilitate the (re)distribution of tweets outside of one’s immediate, more “bounded” network to broader, more unknown audiences.  It is also one of the central mechanisms by which tweets become noticed by others on Twitter.  Specifically, if a tweet is retweeted often enough or by the right person(s), it gathers momentum that can emulate a snowball effect.

Since he describes it that way, I wonder if there is an information theory problem in there.  Anyway, @dhirajmurthy goes on to say [p. 150]:

Of course, if those promoted tweets are significantly retweeted, that will have more direct effects on Twitter’s modes of originating popular discourse.  The terminology used by marketing professionals is between “organic” and “promoted” trending topics.  The label of “organic” implies more of a grassroots development of a topic, whereas the “promoted” version aims to skip the grassroots building of a topic.  As you can imagine, skipping the construction of a support base can have consequences on the popularity of a topic.  Promoted topics have changed Twitter in that they have brought monetization into tweet audience reception.  However, as the statistics from Twitter show, “organic” topics are the most popular.

So I suppose if one wants to be popular on Twitter, one should take advantage of the interactive multicasting nature of the medium and try to be as organic as possible.  Since you’ve been at it for some time, do you agree?

To be organic, it seems prudent to follow a good number of people on Twitter.  But something like Dunbar’s number must surely come into play.  There is a fairly new paper by Dunbar and several others out that discusses the ability to stay in touch.  The title is “Time as a limited resource: Communication strategy in mobile phone networks,” and the authors are Giovanna Miritello, Esteban Moro, Rubén Lara, Rocío Martínez-López, John Belchamber, Sam G.B. Roberts, and Robin I. M. Dunbar.  The main result is that there are time constraints which limit tie strength (as measured by time spent communicating) in large personal networks, and that even high levels of mobile communication do not alter the disparity of time allocation across networks.  This is argued from the fact compared to those with smaller networks, those with large networks do not devote proportionally more time to communication and have on average weaker ties.  Of course time is an inelastic resource, and people only have a limited amount of time in each day to devote to social interaction.

A related paper is titled “Limited communication capacity unveils strategies for human interaction” and is written by several of the same authors.  In particular Giovanna Miritello, Rubén Lara, Manuel Cebrian, and Esteban Moro.  Again the underlying theoretical framing is around the fact that time, attention, and cognitive resources are inelastic.  Each person is characterized by a communication capacity and by a communication activity level which are different from each other.  The authors then get into how social activity is influenced by these limitations, again studying data from Telefonica in Spain.

The flip side to this is of course having the time, attention, and cognitive resources to tweet.  I saw a recent blog post that describes this as the writer’s shuffle: there are only so many words one can write in a day.  

Given my limited cognitive resources, let’s see if I end up tweeting much or not at all.