Flying, flying, digging, digging

October 27, 2015

As you know, I’ve been travelling quite a bit the last month or so.  I think I may have put on more miles per unit time than ever before.  While flying around, I read a good number of popular books that I had been meaning to, in the broad area of information science.  For example, I read Bursts by Lazslo Barabasi and learned more about Transylvanian history than I intended.  I also read Social Physics by my one-time collaborator Sandy Pentland, as well as The Life and Work of George Boole: A Prelude to the Digital Age by Desmond MacHale.  I had received this last book as a gift for giving one of the big talks at the When Boole Meets Shannon Workshop at University College Cork in early September.  An extensive biography, it also emphasizes how Boole’s The Laws of Thought makes a strong connection between logic and set theory on the one hand and probability theory on the other, a hundred years before Kolmogorov.  When Boole was reading extracts from the book-in-progress to his wife-to-be Mary Everest, [p. 148]:

She confessed that she felt comforted by the fact that the laws by which the human mind operates were governed by algebraic principles!

Incidentally, in 1868, Mary Boole also wrote a book, The Message of Psychic Science, which had the following rather prescient passage inspired by Babbage’s computer and Jevons’ syllogism evaluator [p. 267]:

Between them they have conclusively proved, by unanswerable logic of facts, that calculation and reasoning, like weaving and ploughing, are work, not for human souls, but for clever combinations of iron and wood.  If you spend time doing work that a machine could do faster than yourselves, it should only be for exercise, as you swing dumb-bells; or for amusement as you dig in your garden; or to soothe your nerves by its mechanicalness, as you take up knitting; not in any hope of so working your way to the truth.

Speaking of iron and wood, one last book I read in my travels is Why Information Grows by Cesar Hidalgo, and a first thing he discusses is how solids are needed to store information.  As he says, close to my heart [p. 34]:

Schrodinger understood that aperiodicity was needed to store information, since a regular crystal would be unable to carry much information.

Let me list my travel venues for you:

And now with that travel done, I think I’ll be going hard on writing and maybe even some theorem-proving and data analytics.  As we’ve discussed, I find blogging to sometimes jump start the writing/doing engines, and so here we go with cities.  

As promised previously, I perform some formal tests for lognormal distributions of house sizes in Mohenjo Daro and in Syracuse. As a starting point, I used the lognfit function in matlab to find the maximum likelihood estimates of the fit parameters and also the 95% confidence intervals.  The two parameters are the mean μ and standard deviation σ of the associated normal distribution.  The estimated value of σ is the square root of the unbiased estimate of the variance of the log of the data.  Rather than showing the rank-frequency plots as in the previous post, let me show the cumulative distribution functions.  Note that in Syracuse data, about 1/5 of houses do not have a listed living area, so I exclude them from this analysis.



At least visually, these don’t look like the best of fits.  To measure the goodness of fit, I use the chi-square goodness-of-fit test as implemented in matlab as chi2gof.  With data ‘area’ already fit using lognfit into parameter vector ‘parmhat’, this is [h,p] = chi2gof(area,’cdf’,@(z)logncdf(z,parmhat(1),parmhat(2)),’nparams’,2). Despite the visual evidence, the chi-square test does not reject the null hypothesis of lognormality at the 5% confidence level for Mohenjo Daro.  The chi-square test does reject the null hypothesis of lognormality at the 5% confidence level for Syracuse, contrary to the theory of Bettencourt, et al.  I wonder what the explanation might be for this contrary finding in Syracuse: maybe some data fidelity issues?

By the way, I also promised some other nuggets and so here is one: the relationship between living area and value in Syracuse.


There is certainly more than just living area that determines value.  In fact, the methodology of assessing house value is an interesting one.  One more nugget is on when houses that existed in Syracuse in July 2011 were built.


I wonder if there is a way to understand this data through a birth-death process model.  There is a nice theoretical paper in this general direction, “Random Fluctuations in the Age-Distribution of a Population Whose Development is Controlled by the Simple “Birth-and-Death” Process,” by David G. Kendall from the J. Royal Statistical Society in 1950.

To close the story with more birth and death, unlike studying Mohenjo Daro, on which there is little information, the difficulty for future historians will certainly be too much data. Before the travels, I finished reading through a popular book Dataclysm by the author of the OKCupid blog, in a sense it is an expanded version of that blog. One of the big things that is pointed out is that there will be growing longitudinal data about individuals due to social media such as Facebook.  Collections of pennants are eventually taken down from bedroom walls, but nothing is taken down from Facebook walls. It uses culturomics.  

As I may have foreshadowed, Ron Kline’s book, The Cybernetic Moment (that I helped with a little bit), also uses culturomics a little bit to measure the nature of discourse.

So that is some flying, flying, and digging, digging from me.  Hope you’ll contribute to the discourse so future historians have more to study.  By the way, the city sizes for the various places (as per Wikipedia today) are, from large to small:

  • 13,216,221 – Tokyo
  • 2,722,389 – Chicago
  • 435,413 – Jeju
  • 84,513 – Champaign
  • 67,947 – Santa Fe
  • 41,250 – Urbana
  • 12,019 – Los Alamos
  • 5,138 – Monticello

Perhaps data for a statistical assessment?


Digging into a city

March 26, 2015

Glad to see your work described previously now appearing in a journal paper, but also glad to know that it is doing social good.  One of the main things you did was look for buildings from satellite imagery, which is really quite a neat thing.  As you know, I have been quite intrigued by the science of cities, and perhaps data from satellite imagery can be useful to make empirical statements there.  Can one see municipal waste remotely?  In anticipation of that, perhaps I can dig through some data on cities that I happen to have, and see if there are interesting statements to be made regarding scaling laws within cities (in contrast to most work that has focused on scaling laws among cities, though I should note the work of Batty, et al.).  As examples, I will consider recent data from our hometown of Syracuse, NY and also data from Mohenjo Daro of the Indus Valley civilization.  

As you can guess, the Syracuse data was gathered from my service on an IBM Smarter Cities Challenge team, by digging through some old servers held by a not-for-profit partner of the City of Syracuse.  The journal paper on that is finally out, but more importantly it seems to be having some social impact.  Here is a newer video on impacts of what we did there.

The data on Mohenjo Daro is from actual digging, rather than digging through computers.  Built around 2600 BCE, Mohenjo Daro was one of the largest settlements of the ancient Indus Valley Civilization and one of the world’s earliest major urban settlements.  Mohenjo Daro was abandoned in the 19th century BCE, and was not rediscovered until 1922.  The data I will use was initially mapped by British archaeologists in the 1930s in their excavation of Mohenjo Daro, and collected in the paper [Anna Sarcina, “A Statistical Assessment of House Patterns at Moenjo Daro,” Mesopotamia Torino, vol. 13-14, pp. 155-199, 1978.].

Before getting to the data, though, let me describe some theoretical work on the distribution of house sizes from a recent paper of Bettencourt, et al. in a new open access journal from AAAS.  From the settlement scaling theory developed, they make a prediction on the distribution of house areas.  In particular, the overall distribution should be approximately lognormal.  This prediction is borne out in archeological data of houses in pre-Hispanic cities in Mexico.  The basic argument for why the lognormal distribution should arise is from a multiplicative generative process and the central limit theorem.  A reference therein attributes the argument back to William Shockley in studying the productivity of scientists, but according to Mitzenmacher it goes back even further.  (Service times in call centers also appear to be approximately lognormal, among other phenomena)

Anyway, coming to our data, let me first show the rank-frequency plot of the surface area (m2) of 183 houses in Mohenjo Daro.

Mohenjo Daro

Now I show the rank-frequency plot of the living area (ft2) of 41804 houses in Syracuse (data from July 2011).Syracuse

What do you think?  Does it look approximately lognormal?  I’ll soon write another blog post with some formal statistical analysis, and some other nuggets from these data sets.

Incidentally, as requested, I seem to be making creativity a part of my research agenda (from an information theory and statistical signal processing perspective).  I spoke about fundamental limits to creativity at the ITA Workshop in San Diego in February (though the talk itself ended up being slightly different than the abstract).  I also organized a special session on computational creativity, which was fun.  

I think someone should connect creativity and cities in some precise informational way, and perhaps you are the man to do it.


Remote Sensing

September 7, 2014

Acquiring information from afar is often quite important, whether to reduce the cost of ground investigation, to get a wider view, or perhaps to conceal surveillance activities.  A couple weeks ago at the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining you had a ‘social good’ paper on using remote sensing data to predict which villages have more poor people than others, based on whether there were more houses with metal roofs or with thatch roofs.  An earlier presentation of this work was given at DataKind, under whose auspices the work was carried out together with the the charity GiveDirectly.

Congratulations on the best paper award for this work!

Incidentally I also enjoyed your work on tennis analytics at the same conference and was therefore glad I attended part of the Large-Scale Sports Analytics workshop, in addition to the data-driven educational assessment workshop I was running.  

Coming back to remote sensing and somewhat related to the last post, remote sensing of waste production can potentially be used to sense alien civilizations.  Though more apropos to your work, apparently night-time light remote sensing is becoming a common approach to poverty detection, thinking of night lights as light pollution.  A few papers in a variety of journals on this topic include this, this, and this.  I wonder, though, whether there is a way to measure “signal pollution” as a way to do remote sensing to build on the idea of information metabolism.  With information pollution, maybe it is low-entropy signals one should look for, rather than high-entropy signals.

Perhaps artistic things you can see from the air?




Information Metabolism and Waste Production

August 6, 2014

Your computational aboriginal art is really quite amazing!  I think you can give AARON a run for its money.  In your post previous to your artistic one, you brought up the notion of lateral thinking.  I think part of why it is difficult is because we draw heavily on memory as part of perception.  This is well-captured in Leslie Valiant’s Probably Approximately Correct book [p. 142]:

In different parts of the world the same word may have different meanings, or the distribution of examples may be different.  In these cases the Invariance Assumption would be violated, and shared meaning would not be achieved.  Misunderstandings would result.  There are pernicious obstacles to shared meaning even beyond those inherent in differences in meaning and distributions.  These further impediments are imposed by the constraint of a limited mind’s eye interacting with an internal memory full of beliefs.  We may all be looking at the same world through our mind’s eyes, but since we have much control of what information to allow in, dependent on our beliefs, we may not see the same world.  In the mind’s eye we process not only the information coming from outside, but also information internally retrieved from our long-term memory.

As you know, I was in Quebec City last week for the workshops following the Computational Neuroscience Meeting. I spoke about associative memories and how a little bit of circuit noise can improve recall, but also heard an interesting talk by Byron Yu on how it may be difficult to learn things you are not used to.  Details on learning things off-subspace will soon be published in an experimental neuroscience Nature paper.  

All of this talk about informational inputs, outputs, and internals has gotten me thinking about whether one can define a useful notion of information metabolism, in analogy to metabolism, which (as per Wikipedia) is the set of life-sustaining chemical transformations within the cells of living organisms. These enzyme-catalyzed reactions allow organisms to grow and reproduce, maintain their structures, and respond to their environments.  [There is apparently already Kepinski’s notion of information metabolism, but I believe I want something different.]

In an earlier post, I suppose I did discuss metabolic processes of waste, primarily allometric scaling laws for waste production by mammals.  Though as I mentioned, some of renewed interest in scaling laws comes from the science of cities.  One might wonder, then about scaling laws for waste as functions of city population rather than animal mass.  Luis Bettencourt has called cities a sort of social reactor that is part star and part network, and so understanding physical flows might be inspirational for understanding informational flows.  Of course sanitation is super essential to public health and welfare in its own right.

As it turns out, there are quite a few people interested in scaling laws for waste in cities, and there seems to be a growing debate in the scientific literature.  Let me summarize discussions on air pollution.  

  • Using data from the Emissions Database for Global Atmospheric Research (EDGAR), Marcotullio et al. find that larger cities have more greenhouse gas emissions (CO2, N2O, CH4, and SF6), in that a small increase in population size in any particular area is associated with a disproportionately larger increase in emissions, on average.
  • Curating a variety of data sources on CO2 emissions, Rybski et al. argue that cities in developing countries are different from cities in developed countries.  In particular, in developing countries, large cities emit more CO2 per capita than small cities with power-law exponent 1.15, where in developed countries large cities are more efficient with power-law exponent 0.80.  (These exponent numbers seem to have numerical significance, as per Bettencourt)
  • Fragkias et al. also look at CO2 emissions, focusing on cities (metropolitan statistical areas) in the United States, but do not find too much increase in efficiency, but find a near-proportional increase.
  • Oliveira et al.  also consider CO2 emissions in American cities (more concentrated than MSAs), but find a strongly superlinear increase, with a power-law exponent of 1.46.

So there seems to be a great deal of uncertainty on what is happening empirically.  Notwithstanding, theories that link together emissions with traffic congestion have also been proposed.

To add some more fuel to the fire, I thought I might plot out some data too.  Rather than emissions, I looked at air quality measures.  As an example, I took population data and air quality data for some cities in India, and joined them, ignoring data where either was missing.  Note there are often multiple measuring stations within a given city, and I treat them as having their own air quality value, but the same population value.  Here are the results for sulfur dioxide.

So maybe nothing too conclusive there.  I wonder if there are other air quality indicators that have some connection to city population.

Sorry for this information snack full of empty calories.


Aboriginal Art

July 10, 2014

Budyari yaguna, Señor William Dawes.  In my Australian adventure, I not only came across media about creativity but also creative media, especially of the aboriginal variety. Before getting to that, however, let me give a shout out to the Bhullar brothers, about whom a viral campaign we never did start, for getting a toehold in the NBA.

Tarisse KingOn the first day of the trip, we saw a traditional aboriginal didgeridoo and dance performance at Currumbin in Queensland. Later in the trip, we saw the aboriginal-inspired contemporary dance production Patyegarang at the Sydney Opera House.  Didgeridoo street performers greeted us at Circular Quay before we made our way by ferry to Manly.

Some of the beach-front galleries there were of aboriginal art.  I was especially drawn to the works of Tarisse King including her Earth Images. The dot paintings are mesmerizing in a unique way.  They are meant to represent a view of the earth from above.

Tarisse King: Fire

Upon looking at her paintings, I wondered to myself whether something similar could be created using Gaussian processes and morphological image processing.  Here is my attempt at computer art via the following Matlab script:

seed = 1234; %random seed
n = 50; %grid size
r = 10; %repititions per point
s = 8; %resize scale for skeleton
l = 0.05; %squared exponential kernel parameter


%create a grid
[X1,X2] = meshgrid(linspace(0,1,n),linspace(0,1,n));
x = [X1(:),X2(:)];

%covariance calculation using squared exponential kernel
K = exp(-squareform(pdist(x).^2/(2*l^2)));

%sample from Gaussian process
gaussian_process_sample = A*randn(n^2,1);

%calculate skeleton of the peaks
skeleton = bwmorph(imresize(reshape(real(gaussian_process_sample),n,n),s)>0,'skel',Inf);

%plot the painting on a black background using randomly perturbed copies of points from the Gaussian process sample and overlay the skeleton
figure; hold on;
h = imshow(cat(3,ones(n*s,n*s),ones(n*s,n*s),0.8*ones(n*s,n*s)),'XData',linspace(0,1,n*s),'YData',linspace(0,1,n*s));
axis on; axis image; whitebg(gcf,'k'); set(gca,'XTick',[],'YTick',[],'box','on');

seed = 1234
seed = 1235
seed = 1236

What do you think?


Random Episodic Silent Thought

July 9, 2014

G’day mate.  I had a very nice time in Australia and our olfaction stuff was well received at the SSP Workshop.  While I was away, the computational creativity stuff debuted in its Chef Watson manifestation, but that was only one of many creativity-related things I came across during my trip.

On the flight back, I watched The Lego Movie, which in addition to featuring a 1980-Something Space Guy like we used to play with at the Mehrotra residence, is a commentary on the value of creativity.  I hadn’t realized beforehand that the movie’s theme was the supremacy of creatively building things over only following the instructions.  I’m glad I watched it.

I came across articles about creativity in The Atlantic and the New York Times Bits Blog.

Another pleasant viewing experience on the flight was the Australian Broadcasting Corporation’s documentary miniseries Redesign My Brain with Todd Sampson.  It helped me understand how several parts of your research flow together.  The first part of the miniseries utilizes the concept of neuroplasticity to show how Lumosity-like exercises can improve brain function along three dimensions: speed of thought, attention, and memory. I think the first of these can be related to typical Shannon theory, the second to some of your new information theory stuff incorporating Bayesian surprise, and the third to your new associative memory stuff.  The second part of the miniseries is all about human creativity starting with divergent thinking and then moving on to four criteria for creativity: effectiveness, novelty, elegance, and genesis.  The divergent thinking, effectiveness, and novelty are very much part of the computational creativity process we espoused, the Chef Watson app is elegant, and the extension to fashion, business processes, etc. that you talk about is the genesis. 

The last part of the creativity episode is about lateral thinking.  I wonder if and how you can investigate or model lateral thinking using information theory and statistical signal processing, and whether you’d want to include it in your research agenda.



June 28, 2014

A friend of the blog was recently asking both of us how to cluster time series (of possibly different lengths), and in response to that query I had looked at the paper “A novel hierarchical clustering algorithm for gene sequences” by Wei, Jiang, Wei, and Wang who are bioinformatics researchers from China and Canada.  The basic idea is to generate feature vectors from the raw time series data, define a distance function on the feature space, and then use this distance measure to do (hierarchical) clustering.  At the time, I also flipped through a survey article on clustering time series, “Clustering of time series data—a survey” by Liao who is an industrial engineer from Louisiana.  As he says, the goal of clustering is to identify structure in an unlabeled data set by organizing data into homogeneous groups where the within-group-object similarity is minimized and the between-group-object dissimilarity is maximized and points out five broad approaches: partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods.  There are of course applications in all kinds of fields, such as biology, finance, and of course social media analytics where one might want to cluster Twitter users according to the time series patterns of tweeting sentiment.

But any technique seems to require some notion of similarity to proceed.  As Leslie Valiant says in his book, Probably Approximately Correct [p. 159]:

PAC learning as we have described it is a model of supervised learning.  One of its strengths is that it is essentially assumption free.  Attempts to formulate analogous theories for unsupervised learning have not been successful.  In unsupervised learning the learner appears to have to make specific assumptions about what similarity means.  If externally provided labels are not available, the learner has to decide which groups of objects are to be categorized as being of one kind, and which of another kind.

I hold the view that supervised learning is a powerful natural phenomenon, while unsupervised learning is not.

So maybe clustering is not a powerful natural phenomenon (but would Rand disagree?), but I’d like to do it anyway.  As some say, clustering is an art rather than a science, but I like art, don’t you? In some sense the question boils down to developing notions of similarity that are appropriate.  Though I must admit I do have some affinity for the notion of “natural kinds” that Bowker and Star sometimes talk about when discussing schemes for classifying various things into categories.  

Let me consider a few examples of clustering to set the stage:

  1. When trying to understand the mapping between neural activity and behavior, is it important to cluster video time series recordings of behavior into a discrete set of “behavorial phenotypes” that can then be understood.  This was done in a paper by Josh Vogelstein et al., summarized here.  An essentially Euclidean notion of similarity was considered.
  2. When trying to understand the nature of the universe and specifically dark matter,  a preprint by my old Edgerton-mate Robyn Sanderson et al., discusses the use of the Kullback-Leibler divergence for measuring things in a probabilistic sense, without having to assert the notion of similarity too much in the original domain.
  3. To take a completely different example, how might people in different cultures cluster colors into named categories?  In fact this has been studied in a large-scale worldwide study, which has made the raw data available.  How does frequency become a categorical named color, and which color is most similar to another?

Within their domains, these clusterings seem to be effective, but is there a general methodology?  One idea that has been studied is to ask people what they think of the results of various formal clustering algorithms, a form of anthropocentric data analysis, as it were.  Can this be put together algorithmically with information-theoretic ideas on sampling distortion functions due to Niesen and all?

Another idea I learned from Jennifer Dy, who I met in Mysore at the National Academy of Engineering‘s Indo-American Frontiers of Engineering Symposium last month, is to actually create several different possible clusterings and then let people decide.  A very intriguing idea.

Finally, one might consider drawing on universal information theory and go from there.  A central construct in universal channel coding is the maximum mutual information (MMI) decoder, which doesn’t require any statistical knowledge, but learns things as it goes along.  Misra and Weissman modified that basic idea to do clustering rather than decoding, in a really neat result. Didn’t make it into Silicon Valley, as far as I can tell, but really neat.  Applications to dark matter?

You are currently en route to Australia to, among other things, present our joint work on olfactory signal processing at the IEEE Statistical Signal Processing workshop.  One paper on active odor cancellation, and the other on food steganography.  Do let me know of any new tips or tricks you pick up down under: hopefully with some labels rather than forcing me to do unsupervised learning.  Also, what would you cluster angelica seed oil with?


Get every new post delivered to your Inbox.