Circle of Life

March 18, 2017

Jambo Señor Bernard Lagat!  Greetings from inside the Maasai Mara where I got to see the whole cast of the Circle of Life.  Before coming for my first trip to Africa, my knowledge of the continent, like nearly all Americans, was pretty much completely derived from The Lion King.  As I was psyching myself up for the journey, I played a medley of songs from the film along with “Dry Your Tears, Afrika.”

However, once I got here, it was not Africa that was drying its tears, but me, as I explained to a member of the Clinton Health Access Initiative how the kindness of the American people over the last 70 years has led directly to our family’s and my being in the position we are in.  Whether we take Sam Higginbottom and Mason Vaugh’s Allahabad Agricultural Institute that gave Baba his first job and encouraged his growth, the sponsors that allowed him to come to Illinois for higher studies with Bill Perkins not once but twice, the granters of the tuition waivers that allowed Papa to study there himself, the policymakers whose policies encouraged someone like him to work and gain lawful residence in the U.S., the agency program managers who funded Papa’s research, or the people behind the National Science Foundation Graduate Research Fellowship that allowed both of us to thrive in graduate school, the American people have consistently encouraged achieving the dream through hard work and the rewarding of skill, knowledge, and expertise regardless of caste or creed.

But it seems like we’re going through a “cultural revolution” that some point to having arisen from the hollowing out of the middle class and rising income inequality due to increased automation of jobs by technological solutions.  I don’t think anyone should be promoting extreme income inequality, but solutions should come from the science and technologies themselves rather than from crippling advances in the science and technologies that have gotten us to this point.  As Stefano Ermon says, “It’s very important that we make sure that [artificial intelligence] is really for everybody’s benefit.”  Last October I submitted a proposal for an artificial intelligence grand challenge for IBM Research to work on, ultimately not selected, on exactly this topic: reducing economic inequality.  Given all that has transpired in the intervening months, my belief that such a project should be undertaken has only strengthened.

Here in Kenya, I had the pleasure of visiting the startups Soko and mSurvey who are both doing their part in democratizing production and the flow of information.  Both have developed profitable technology-based solutions that happen to push back against inequalities in the developing world.  Back in March 2013, I had submitted a proposal “Production by the Masses” for inclusion in IBM Research’s longer-term vision for the corporation, also not selected, which has some of the elements that these two companies and others like them epitomize.  However, it also failed to fully anticipate some of the things that have taken hold recently like the ridiculously high value of data and the power of the blockchain’s distributed ledger, and over-emphasized the distinction between rural and urban populations.  I now see that the same sort of stuff is needed everywhere there are inequalities, which is everywhere.

Yes there is an ideal inclusive Circle of Life (fragile enough to be Scarred).  Let us all strive for that ideal by valuing knowledge and by using existing and new science and technology.



Flying, flying, digging, digging

October 27, 2015

As you know, I’ve been travelling quite a bit the last month or so.  I think I may have put on more miles per unit time than ever before.  While flying around, I read a good number of popular books that I had been meaning to, in the broad area of information science.  For example, I read Bursts by Lazslo Barabasi and learned more about Transylvanian history than I intended.  I also read Social Physics by my one-time collaborator Sandy Pentland, as well as The Life and Work of George Boole: A Prelude to the Digital Age by Desmond MacHale.  I had received this last book as a gift for giving one of the big talks at the When Boole Meets Shannon Workshop at University College Cork in early September.  An extensive biography, it also emphasizes how Boole’s The Laws of Thought makes a strong connection between logic and set theory on the one hand and probability theory on the other, a hundred years before Kolmogorov.  When Boole was reading extracts from the book-in-progress to his wife-to-be Mary Everest, [p. 148]:

She confessed that she felt comforted by the fact that the laws by which the human mind operates were governed by algebraic principles!

Incidentally, in 1868, Mary Boole also wrote a book, The Message of Psychic Science, which had the following rather prescient passage inspired by Babbage’s computer and Jevons’ syllogism evaluator [p. 267]:

Between them they have conclusively proved, by unanswerable logic of facts, that calculation and reasoning, like weaving and ploughing, are work, not for human souls, but for clever combinations of iron and wood.  If you spend time doing work that a machine could do faster than yourselves, it should only be for exercise, as you swing dumb-bells; or for amusement as you dig in your garden; or to soothe your nerves by its mechanicalness, as you take up knitting; not in any hope of so working your way to the truth.

Speaking of iron and wood, one last book I read in my travels is Why Information Grows by Cesar Hidalgo, and a first thing he discusses is how solids are needed to store information.  As he says, close to my heart [p. 34]:

Schrodinger understood that aperiodicity was needed to store information, since a regular crystal would be unable to carry much information.

Let me list my travel venues for you:

And now with that travel done, I think I’ll be going hard on writing and maybe even some theorem-proving and data analytics.  As we’ve discussed, I find blogging to sometimes jump start the writing/doing engines, and so here we go with cities.  

As promised previously, I perform some formal tests for lognormal distributions of house sizes in Mohenjo Daro and in Syracuse. As a starting point, I used the lognfit function in matlab to find the maximum likelihood estimates of the fit parameters and also the 95% confidence intervals.  The two parameters are the mean μ and standard deviation σ of the associated normal distribution.  The estimated value of σ is the square root of the unbiased estimate of the variance of the log of the data.  Rather than showing the rank-frequency plots as in the previous post, let me show the cumulative distribution functions.  Note that in Syracuse data, about 1/5 of houses do not have a listed living area, so I exclude them from this analysis.



At least visually, these don’t look like the best of fits.  To measure the goodness of fit, I use the chi-square goodness-of-fit test as implemented in matlab as chi2gof.  With data ‘area’ already fit using lognfit into parameter vector ‘parmhat’, this is [h,p] = chi2gof(area,’cdf’,@(z)logncdf(z,parmhat(1),parmhat(2)),’nparams’,2). Despite the visual evidence, the chi-square test does not reject the null hypothesis of lognormality at the 5% confidence level for Mohenjo Daro.  The chi-square test does reject the null hypothesis of lognormality at the 5% confidence level for Syracuse, contrary to the theory of Bettencourt, et al.  I wonder what the explanation might be for this contrary finding in Syracuse: maybe some data fidelity issues?

By the way, I also promised some other nuggets and so here is one: the relationship between living area and value in Syracuse.


There is certainly more than just living area that determines value.  In fact, the methodology of assessing house value is an interesting one.  One more nugget is on when houses that existed in Syracuse in July 2011 were built.


I wonder if there is a way to understand this data through a birth-death process model.  There is a nice theoretical paper in this general direction, “Random Fluctuations in the Age-Distribution of a Population Whose Development is Controlled by the Simple “Birth-and-Death” Process,” by David G. Kendall from the J. Royal Statistical Society in 1950.

To close the story with more birth and death, unlike studying Mohenjo Daro, on which there is little information, the difficulty for future historians will certainly be too much data. Before the travels, I finished reading through a popular book Dataclysm by the author of the OKCupid blog, in a sense it is an expanded version of that blog. One of the big things that is pointed out is that there will be growing longitudinal data about individuals due to social media such as Facebook.  Collections of pennants are eventually taken down from bedroom walls, but nothing is taken down from Facebook walls. It uses culturomics.  

As I may have foreshadowed, Ron Kline’s book, The Cybernetic Moment (that I helped with a little bit), also uses culturomics a little bit to measure the nature of discourse.

So that is some flying, flying, and digging, digging from me.  Hope you’ll contribute to the discourse so future historians have more to study.  By the way, the city sizes for the various places (as per Wikipedia today) are, from large to small:

  • 13,216,221 – Tokyo
  • 2,722,389 – Chicago
  • 435,413 – Jeju
  • 84,513 – Champaign
  • 67,947 – Santa Fe
  • 41,250 – Urbana
  • 12,019 – Los Alamos
  • 5,138 – Monticello

Perhaps data for a statistical assessment?


Digging into a city

March 26, 2015

Glad to see your work described previously now appearing in a journal paper, but also glad to know that it is doing social good.  One of the main things you did was look for buildings from satellite imagery, which is really quite a neat thing.  As you know, I have been quite intrigued by the science of cities, and perhaps data from satellite imagery can be useful to make empirical statements there.  Can one see municipal waste remotely?  In anticipation of that, perhaps I can dig through some data on cities that I happen to have, and see if there are interesting statements to be made regarding scaling laws within cities (in contrast to most work that has focused on scaling laws among cities, though I should note the work of Batty, et al.).  As examples, I will consider recent data from our hometown of Syracuse, NY and also data from Mohenjo Daro of the Indus Valley civilization.  

As you can guess, the Syracuse data was gathered from my service on an IBM Smarter Cities Challenge team, by digging through some old servers held by a not-for-profit partner of the City of Syracuse.  The journal paper on that is finally out, but more importantly it seems to be having some social impact.  Here is a newer video on impacts of what we did there.

The data on Mohenjo Daro is from actual digging, rather than digging through computers.  Built around 2600 BCE, Mohenjo Daro was one of the largest settlements of the ancient Indus Valley Civilization and one of the world’s earliest major urban settlements.  Mohenjo Daro was abandoned in the 19th century BCE, and was not rediscovered until 1922.  The data I will use was initially mapped by British archaeologists in the 1930s in their excavation of Mohenjo Daro, and collected in the paper [Anna Sarcina, “A Statistical Assessment of House Patterns at Moenjo Daro,” Mesopotamia Torino, vol. 13-14, pp. 155-199, 1978.].

Before getting to the data, though, let me describe some theoretical work on the distribution of house sizes from a recent paper of Bettencourt, et al. in a new open access journal from AAAS.  From the settlement scaling theory developed, they make a prediction on the distribution of house areas.  In particular, the overall distribution should be approximately lognormal.  This prediction is borne out in archeological data of houses in pre-Hispanic cities in Mexico.  The basic argument for why the lognormal distribution should arise is from a multiplicative generative process and the central limit theorem.  A reference therein attributes the argument back to William Shockley in studying the productivity of scientists, but according to Mitzenmacher it goes back even further.  (Service times in call centers also appear to be approximately lognormal, among other phenomena)

Anyway, coming to our data, let me first show the rank-frequency plot of the surface area (m2) of 183 houses in Mohenjo Daro.

Mohenjo Daro

Now I show the rank-frequency plot of the living area (ft2) of 41804 houses in Syracuse (data from July 2011).Syracuse

What do you think?  Does it look approximately lognormal?  I’ll soon write another blog post with some formal statistical analysis, and some other nuggets from these data sets.

Incidentally, as requested, I seem to be making creativity a part of my research agenda (from an information theory and statistical signal processing perspective).  I spoke about fundamental limits to creativity at the ITA Workshop in San Diego in February (though the talk itself ended up being slightly different than the abstract).  I also organized a special session on computational creativity, which was fun.  

I think someone should connect creativity and cities in some precise informational way, and perhaps you are the man to do it.


Remote Sensing

September 7, 2014

Acquiring information from afar is often quite important, whether to reduce the cost of ground investigation, to get a wider view, or perhaps to conceal surveillance activities.  A couple weeks ago at the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining you had a ‘social good’ paper on using remote sensing data to predict which villages have more poor people than others, based on whether there were more houses with metal roofs or with thatch roofs.  An earlier presentation of this work was given at DataKind, under whose auspices the work was carried out together with the the charity GiveDirectly.

Congratulations on the best paper award for this work!

Incidentally I also enjoyed your work on tennis analytics at the same conference and was therefore glad I attended part of the Large-Scale Sports Analytics workshop, in addition to the data-driven educational assessment workshop I was running.  

Coming back to remote sensing and somewhat related to the last post, remote sensing of waste production can potentially be used to sense alien civilizations.  Though more apropos to your work, apparently night-time light remote sensing is becoming a common approach to poverty detection, thinking of night lights as light pollution.  A few papers in a variety of journals on this topic include this, this, and this.  I wonder, though, whether there is a way to measure “signal pollution” as a way to do remote sensing to build on the idea of information metabolism.  With information pollution, maybe it is low-entropy signals one should look for, rather than high-entropy signals.

Perhaps artistic things you can see from the air?




Information Metabolism and Waste Production

August 6, 2014

Your computational aboriginal art is really quite amazing!  I think you can give AARON a run for its money.  In your post previous to your artistic one, you brought up the notion of lateral thinking.  I think part of why it is difficult is because we draw heavily on memory as part of perception.  This is well-captured in Leslie Valiant’s Probably Approximately Correct book [p. 142]:

In different parts of the world the same word may have different meanings, or the distribution of examples may be different.  In these cases the Invariance Assumption would be violated, and shared meaning would not be achieved.  Misunderstandings would result.  There are pernicious obstacles to shared meaning even beyond those inherent in differences in meaning and distributions.  These further impediments are imposed by the constraint of a limited mind’s eye interacting with an internal memory full of beliefs.  We may all be looking at the same world through our mind’s eyes, but since we have much control of what information to allow in, dependent on our beliefs, we may not see the same world.  In the mind’s eye we process not only the information coming from outside, but also information internally retrieved from our long-term memory.

As you know, I was in Quebec City last week for the workshops following the Computational Neuroscience Meeting. I spoke about associative memories and how a little bit of circuit noise can improve recall, but also heard an interesting talk by Byron Yu on how it may be difficult to learn things you are not used to.  Details on learning things off-subspace will soon be published in an experimental neuroscience Nature paper.  

All of this talk about informational inputs, outputs, and internals has gotten me thinking about whether one can define a useful notion of information metabolism, in analogy to metabolism, which (as per Wikipedia) is the set of life-sustaining chemical transformations within the cells of living organisms. These enzyme-catalyzed reactions allow organisms to grow and reproduce, maintain their structures, and respond to their environments.  [There is apparently already Kepinski’s notion of information metabolism, but I believe I want something different.]

In an earlier post, I suppose I did discuss metabolic processes of waste, primarily allometric scaling laws for waste production by mammals.  Though as I mentioned, some of renewed interest in scaling laws comes from the science of cities.  One might wonder, then about scaling laws for waste as functions of city population rather than animal mass.  Luis Bettencourt has called cities a sort of social reactor that is part star and part network, and so understanding physical flows might be inspirational for understanding informational flows.  Of course sanitation is super essential to public health and welfare in its own right.

As it turns out, there are quite a few people interested in scaling laws for waste in cities, and there seems to be a growing debate in the scientific literature.  Let me summarize discussions on air pollution.  

  • Using data from the Emissions Database for Global Atmospheric Research (EDGAR), Marcotullio et al. find that larger cities have more greenhouse gas emissions (CO2, N2O, CH4, and SF6), in that a small increase in population size in any particular area is associated with a disproportionately larger increase in emissions, on average.
  • Curating a variety of data sources on CO2 emissions, Rybski et al. argue that cities in developing countries are different from cities in developed countries.  In particular, in developing countries, large cities emit more CO2 per capita than small cities with power-law exponent 1.15, where in developed countries large cities are more efficient with power-law exponent 0.80.  (These exponent numbers seem to have numerical significance, as per Bettencourt)
  • Fragkias et al. also look at CO2 emissions, focusing on cities (metropolitan statistical areas) in the United States, but do not find too much increase in efficiency, but find a near-proportional increase.
  • Oliveira et al.  also consider CO2 emissions in American cities (more concentrated than MSAs), but find a strongly superlinear increase, with a power-law exponent of 1.46.

So there seems to be a great deal of uncertainty on what is happening empirically.  Notwithstanding, theories that link together emissions with traffic congestion have also been proposed.

To add some more fuel to the fire, I thought I might plot out some data too.  Rather than emissions, I looked at air quality measures.  As an example, I took population data and air quality data for some cities in India, and joined them, ignoring data where either was missing.  Note there are often multiple measuring stations within a given city, and I treat them as having their own air quality value, but the same population value.  Here are the results for sulfur dioxide.

So maybe nothing too conclusive there.  I wonder if there are other air quality indicators that have some connection to city population.

Sorry for this information snack full of empty calories.


Aboriginal Art

July 10, 2014

Budyari yaguna, Señor William Dawes.  In my Australian adventure, I not only came across media about creativity but also creative media, especially of the aboriginal variety. Before getting to that, however, let me give a shout out to the Bhullar brothers, about whom a viral campaign we never did start, for getting a toehold in the NBA.

Tarisse KingOn the first day of the trip, we saw a traditional aboriginal didgeridoo and dance performance at Currumbin in Queensland. Later in the trip, we saw the aboriginal-inspired contemporary dance production Patyegarang at the Sydney Opera House.  Didgeridoo street performers greeted us at Circular Quay before we made our way by ferry to Manly.

Some of the beach-front galleries there were of aboriginal art.  I was especially drawn to the works of Tarisse King including her Earth Images. The dot paintings are mesmerizing in a unique way.  They are meant to represent a view of the earth from above.

Tarisse King: Fire

Upon looking at her paintings, I wondered to myself whether something similar could be created using Gaussian processes and morphological image processing.  Here is my attempt at computer art via the following Matlab script:

seed = 1234; %random seed
n = 50; %grid size
r = 10; %repititions per point
s = 8; %resize scale for skeleton
l = 0.05; %squared exponential kernel parameter


%create a grid
[X1,X2] = meshgrid(linspace(0,1,n),linspace(0,1,n));
x = [X1(:),X2(:)];

%covariance calculation using squared exponential kernel
K = exp(-squareform(pdist(x).^2/(2*l^2)));

%sample from Gaussian process
gaussian_process_sample = A*randn(n^2,1);

%calculate skeleton of the peaks
skeleton = bwmorph(imresize(reshape(real(gaussian_process_sample),n,n),s)>0,'skel',Inf);

%plot the painting on a black background using randomly perturbed copies of points from the Gaussian process sample and overlay the skeleton
figure; hold on;
h = imshow(cat(3,ones(n*s,n*s),ones(n*s,n*s),0.8*ones(n*s,n*s)),'XData',linspace(0,1,n*s),'YData',linspace(0,1,n*s));
axis on; axis image; whitebg(gcf,'k'); set(gca,'XTick',[],'YTick',[],'box','on');

seed = 1234
seed = 1235
seed = 1236

What do you think?


Random Episodic Silent Thought

July 9, 2014

G’day mate.  I had a very nice time in Australia and our olfaction stuff was well received at the SSP Workshop.  While I was away, the computational creativity stuff debuted in its Chef Watson manifestation, but that was only one of many creativity-related things I came across during my trip.

On the flight back, I watched The Lego Movie, which in addition to featuring a 1980-Something Space Guy like we used to play with at the Mehrotra residence, is a commentary on the value of creativity.  I hadn’t realized beforehand that the movie’s theme was the supremacy of creatively building things over only following the instructions.  I’m glad I watched it.

I came across articles about creativity in The Atlantic and the New York Times Bits Blog.

Another pleasant viewing experience on the flight was the Australian Broadcasting Corporation’s documentary miniseries Redesign My Brain with Todd Sampson.  It helped me understand how several parts of your research flow together.  The first part of the miniseries utilizes the concept of neuroplasticity to show how Lumosity-like exercises can improve brain function along three dimensions: speed of thought, attention, and memory. I think the first of these can be related to typical Shannon theory, the second to some of your new information theory stuff incorporating Bayesian surprise, and the third to your new associative memory stuff.  The second part of the miniseries is all about human creativity starting with divergent thinking and then moving on to four criteria for creativity: effectiveness, novelty, elegance, and genesis.  The divergent thinking, effectiveness, and novelty are very much part of the computational creativity process we espoused, the Chef Watson app is elegant, and the extension to fashion, business processes, etc. that you talk about is the genesis. 

The last part of the creativity episode is about lateral thinking.  I wonder if and how you can investigate or model lateral thinking using information theory and statistical signal processing, and whether you’d want to include it in your research agenda.