h1

The Pandemic Bandwagon

March 21, 2020


A Wuhan-shake to you señor. Hope you’re doing alright with the shelter-in-place order for Santa Clara County and California. We’re on pause here in New York.

Yesterday morning, I was all psyched up to do a blog post and accompanying twitter thread on 12 Data Science Problems in Mitigating and Managing Pandemics Like COVID-19 that would go through several issues related to the crisis that have some avenue for data science (broadly construed) to contribute to. I’ve been glued to twitter the last few evenings and a lot of different people have been posting various things. I have things to share, I thought, so why not me?

A wise person asked me to reflect on whether it would be a sensible thing to do. She emphasized that “there are so many people who are jumping on the bandwagon trying to help. Some mean well while some are capitalizing on the situation. And of those that mean well, some are offering silly things.” As you’ve told me on occasion, Shannon was wary of the bandwagon as well, and much preferred the “slow tedious process of hypothesis and experimental validation.” He noted that “a few first rate research papers are preferable to a large number that are poorly conceived or half-finished.” What would he have said to streams of consciousness offered up 280 characters at a time? Adam Rogers wrote yesterday afternoon that “the chatter about a promising drug to fight Covid-19 started, as chatter often does (but science does not), on Twitter.”

I woke up this morning wishing for science, not chatter. I realized that I am not among “men ages 24–36 working in tech” predisposed to “armchair epidemiology.” I turned 37 a whopping five months ago!

Rogers continued: “Silicon Valley lionizes people who rush toward solutions and ignore problems; science is designed to find solutions by identifying those problems.”

So lets talk about problems and how run-of-the-mill data scientists working in isolation, both literally and figuratively, usually lack the requisite problem understanding to make the right contribution.

In dealing with global disease outbreaks, such as the ongoing novel coronavirus pandemic, we can imagine four main opportunities to help: surveillance, testing, management, and cure. We are primarily concerned with zoonotic diseases: diseases that transfer from animals to humans.  By surveilling, we mean tools and techniques for predicting or providing early warnings of outbreaks of novel or known pathogens. By testing, we mean diagnosing individual patients with the disease. By managing, we mean the tools and techniques for better understanding and limiting the spread of the outbreak, providing care, and engaging the citizenry.  By curing, we mean the development of therapeutic agents to administer to infected individuals. In all of these areas, the lone data scientist working without true problem understanding can be misguided at best and detrimental at worst.

Surveilling

  1. Zoonotic pathogen prediction. There are a large number of known pathogens, but for most of them, it is not known whether they can transfer from animals into humans (and develop into outbreaks). It may be possible to predict the likely candidates by training on features of known zoonotic pathogens. We tried doing it a few years ago in partnership with disease ecologist Barbara Han who defined the relevant features, but didn’t get very far because the features of pathogens are not available in a nice clean tabular dataset; they are locked up inside scientific publications. Knowledge extraction from these very specialized documents automatically requires a lot of expert ecologist-annotated documents, which is not tenable. Even if we were able to pull together a dataset suitable for predicted zoonoses, we wouldn’t know how to make heads or tails of the results without the disease ecologists.
  2. Informed spillover surveillance. Once a pathogen is known as a zoonotic disease and has had an outbreak, it is important to monitor it for future outbreaks or spillovers. Reservoir species harbor pathogens without having symptoms and without dying, waiting for a vector to carry the disease to humans and start another outbreak. In the first year of the IBM Science for Social Good initiative, we partnered with the same disease ecologist to develop algorithms for predicting the reservoir species of primates for Zika virus in the Americas so that populations of those species could be further scrutinized and monitored. Without Barbara, we would have had no clue about what problem to solve, what data sources to trust, how to overcome severe class imbalance in the prediction task (by combining data from other viruses in the same family), and how the predictions could inform policy.
  3. Outbreak early warning. The earlier we know that an outbreak is starting, the earlier actions can be taken to contain it. There are often small signals in various reports and other data that indicate a disaster is beginning. BlueDot knew something was up with the novel coronavirus as early as December 30, 2019, but they’ve been at this for quite a while and have a team that includes veterinarians, doctors, and epidemiologists. Even then, their warnings were not heeded as strongly as they could have been.

Testing

  1. Group testing. There are shortages of COVID-19 tests in certain places. Well-meaning data scientists ask the question: isn’t there a smart way to test more people with the same number of tests (and I’ve seen it asked several different times already, including in an email that a friend from grad school sent both of us). Eventually, someone points out the method of group testing, which has been known since WWII. But even that is not the solution for the current method of testing (PCR). You pointed out in your response to the friend that group testing would require a serological test for COVID-19, which isn’t ready yet. A case of solving a problem with an already known solution that is actually not a relevant problem.
  2. Deep learning from CT images. Deep neural networks have achieved better accuracy than expert physicians in several medical imaging tasks in radiology, dermatology, ophthalmology, and pathology, so it is natural that several groups would try training them for diagnosing COVID-19. Again, a well-meaning effort, but sometimes not executed very well. E.g. this paper uses CT images of COVID-19-confirmed patients from China as the positive class and images of healthy people from the United States as the negative class — which may introduce spurious correlations and artificially inflate the accuracy.  Even if this task is well done, will it find its way into clinical practice?  That has not yet been the case in those tasks mentioned above despite the initial demonstrations having happened several years ago.
  3. Classifying breathing patterns. A paper posted to arXiv with the title Abnormal respiratory patterns classifier may contribute to large-scale screening of people infected with COVID-19 in an accurate and unobtrusive manner claims that “According to the latest clinical research, the respiratory pattern of COVID-19 is different from the respiratory patterns of flu and the common cold. One significant symptom that occurs in the COVID-19 is Tachypnea. People infected with COVID-19 have more rapid respiration.” but the authors provide no reference to this clinical research and I haven’t been able to track it down myself. If there isn’t really any distinguishing difference between respiration patterns with flu and COVID-19, then this work is in vain, and could have been avoided by conferring with clinicians.

Managing

  1. Spatiotemporal epidemiological modeling. Once an outbreak has started, it is important to model its spread to inform decision making in the response. This is the purview of epidemiology and has a lot of nuance to it. Small differences in the input can yield large differences in the output. This should be left to the experts who have been doing it for many years.
  2. Data-driven decision making. Another aspect to managing an outbreak is collecting primary (e.g. case counts), secondary (e.g. hospital beds, personal protective equipment), and tertiary (e.g. transportation and other infrastructure) information. This is highly challenging and in a disaster situation requires both formal and informal means. During the 2014 Ebola outbreak, we observed that there was a lot of enthusiasm for collecting, collating, and visualizing the case counts, but not so much for the secondary and tertiary information, which, according to the true experts, is really the most important for managing the situation. The same focus on the former is true now, but at least there is some focus on the latter. Enthusiasm is great, but better when directed to the important problems.
  3. Engaging the public. In managing outbreaks, it is critical to inform the public of best practices to limit the person-to-person spread of the disease (which may go against cultural norms) and also to receive information from the situation on the ground. This has been done to effect in the past such as during the Ebola outbreak and in certain places now, but seems to be lacking in many other places. Misinformation and disinformation in peer-to-peer and social network platforms appears to be rampant, but there seems to be little ‘tech solutioning’ in this space so far – perhaps the energy is being spent elsewhere.

Curing

  1. Drug repurposing. Interestingly, drugs developed for particular diseases also have therapeutic effect on other diseases. For example, chloroquine, an old malaria drug has an effect on certain cancers and anecdotally seems to show an effect on the novel coronavirus. By finding such old generic drugs whose safety has already been tested and which might be inexpensive and already in large supply, we can quickly start tamping down an outbreak after the therapeutic effect is confirmed in a large-scale clinical trial. But such findings of repurposing are difficult to notice at large scale without the use of natural language processing of scientific publications. A consortium recently released a collection of 29,000 scientific publications related to COVID-19 (CORD-19), but there is very little guidance for NLP researchers on what to do with that data and no subject matter expert support. Therefore, it seems unlikely that anything of much use will come out of it.
  2. Novel drug generation and discovery. Repurposing has its limits; we must also discover completely new drugs for new diseases. State-of-the-art generative modeling approaches have begun that journey, but are currently difficult to control. And moreover, consulting subject matter experts is required to figure out what desirable properties to control for in the generation: things like toxicity and solubility. Finally, generating sequences of candidate drugs in silico only makes sense if there is close coupling with laboratories that can actually synthesize and test the candidates.

In my originally envisioned post, I was going to end with a sort of cute twelfth item: staying at home.  Apart from lumberjacks, data scientists are among the professions most suited to not spreading the coronavirus according to this data presented by the New York Times. But in fact, this is not merely a cute conclusion: it is the one right contribution that data scientists can truly make well while in isolation off the bandwagon. When the fog clears, however, lets be deliberate and work interdisciplinarily to create full, well thought out, and tested solutions for mitigating and managing global pandemics.

h1

NAACL Stats

June 3, 2018

Comment ça se plume? The venerable Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) reconvenes this week. The Great AI War of 2018 revisits New Orleans for another skirmish. 

Following my previous posts on AISTATS paper counts, ICASSP paper counts, ICLR paper counts, and SDM paper counts, below are the numbers for accepted NAACL papers among companies for long papers, short papers, industry papers, and all combined.

Company Paper Count (Long)
Microsoft 10
Amazon 5
Facebook 5
IBM 4
Tencent 4
DeepMind 3
Google 3
JD 3
NTT 3
Adobe 2
Elemental Cognition 2
PolyAI 2
Siemens 2
Agolo 1
Aylien 1
Bloomberg 1
Bytedance 1
Choosito 1
Data Cowboys 1
Educational Testing Service 1
Fuji Xerox 1
Grammarly 1
Huawei 1
Improva 1
Interactions 1
Intuit 1
Philips 1
Samsung 1
Snap 1
Synyi 1
Thomson Reuters 1
Tricorn (Beijing) Technology 1
Company Paper Count (Short)
IBM 4
Google 3
Microsoft 3
Facebook 2
Adobe 1
Alibaba 1
Amazon 1
Ant Financial Services 1
Bloomberg 1
Educational Testing Service 1
Infosys 1
NTT 1
PolyAI 1
Preferred Networks 1
Roam Analytics 1
Robert Bosch 1
Samsung 1
SDL 1
Tencent 1
Thomson Reuters 1
Volkswagen 1
Company Paper Count (Industry)
Amazon 6
eBay 4
IBM 2
Airbnb 1
Boeing 1
Clinc 1
Educational Testing Service 1
EMR.AI 1
Google 1
Interactions 1
Microsoft 1
Nuance 1
SDL 1
XING 1
ZEIT online 1
Company Paper Count (Total)
Microsoft 14
Amazon 12
IBM 10
Facebook 7
Google 7
Tencent 5
eBay 4
NTT 4
Adobe 3
DeepMind 3
Educational Testing Service 3
JD 3
PolyAI 3
Bloomberg 2
Elemental Cognition 2
Interactions 2
Samsung 2
SDL 2
Siemens 2
Thomson Reuters 2

My methodology was to click on all the pdfs in the proceedings and manually note affiliations.

h1

SDM Stats

May 3, 2018

Hello! The venerable SIAM International Conference on Data Mining (SDM) reconvenes today for its eighteenth edition. The Great AI War of 2018 heads down the Pacific coast. 

Following my previous posts on AISTATS paper counts, ICASSP paper counts, and ICLR paper counts, below are the numbers for accepted SDM papers among companies.

Company Paper Count
IBM 5
Baidu 2
Samsung 2
Adobe 1
Facebook 1
Google 1
LinkedIn 1
NEC 1
NTUC Link 1
PPLive 1
Raytheon 1

My methodology is a manual scan of the printed program.

h1

ICLR Stats

April 27, 2018

Hello bonjour! The venerable International Conference on Learning Representations (ICLR) reconvenes Monday for its sixth edition. The Great AI War of 2018 heads a little west. 

Following my previous posts on AISTATS paper counts broken down by institution and ICASSP paper counts broken down by company, below are the numbers for accepted ICLR main conference papers among the top companies.  Like in my ICASSP stats, Google and DeepMind are not treated separately.

Company Paper Count
Google 68
Microsoft 19
Facebook 14
IBM 8
Salesforce 6
Baidu 4
NVIDIA 4
Intel 3

My methodology this time relied on the data compiled by pajoarthur including his logic for converting email addresses to institutions.  However, I aggregated the numbers differently.  He considered ‘Invite to Workshop Track’ status papers in his counts, whereas I did not.  He evaluated the contribution of an author by dividing by the total number of authors of a paper, and then summing up these partial contributions by company; like I did for AISTATS and ICASSP, I counted a paper for a company if it had at least one author from that company.

h1

ICASSP Stats

April 15, 2018

Annyeong! The venerable IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) reconvenes today for its forty-third edition. The Great AI War of 2018 rolls on. 

If you ask how signal processing has become AI, read my recent essay on the topic.

Following my previous post on AISTATS paper counts broken down by institution, I present the numbers for ICASSP below, but only for companies.  This time around, Google and DeepMind are not treated separately.

Company Paper Count
Google 38
NTT 26
Microsoft 25
IBM 16
Huawei 11
Amazon 9
Mitsubishi 9
Samsung 7
Facebook 6
Tencent 6
Alibaba 5
Apple 5
Ericsson 4
Robert Bosch 4
Starkey Hearing Technologies 4
Tata Consultancy Services 4
SRI International 3
Technicolor 3
Toshiba 3
Adobe 2
Analog Devices 2
GE 2
GN 2
Haliburton 2
Hitachi 2
Intel 2
NEC 2
Orange Labs 2
Origin Wireless 2
Qualcomm 2
Raytheon 2
Sony 2
Spotify 2
Thales 2
Toyota 2

I used the official paper index and did not dig deeper into the papers in any way.

h1

AISTATS Stats

April 9, 2018

Hola! The venerable International Conference on Artificial Intelligence and Statistics (AISTATS) reconvenes today for its twenty-first edition. It has recently become common for there to be blog posts presenting the counts of papers at machine learning conferences broken down by institution.  I had not seen it for AISTATS 2018, so I went ahead and put the numbers together.

Institution Paper Count
MIT 13
UC Berkeley 12
Carnegie Mellon 11
Google 11
Stanford 11
IBM 9
Oxford 9
Princeton 9
INRIA 8
Texas 8
Duke 7
EPFL 7
Cornell 6
DeepMind 6
Harvard 6
Microsoft 6
Tokyo 6
ETH Zurich 5
Georgia Tech 5
Michigan 5
Purdue 5
RIKEN 5

Since The Great AI War of 2018 is apparently ongoing, here are the numbers for companies.

Institution Paper Count
Google 11
IBM 9
DeepMind 6
Microsoft 6
Adobe 3
Amazon 2
NTT 2
Baidu 1
Charles River Analytics 1
D. E. Shaw 1
Disney 1
Face++ 1
Facebook 1
Mind Foundry 1
NAVER LABS 1
NEC 1
Netflix 1
Prowler.io 1
SigOpt 1
Snap 1
Tencent 1
Vicarious 1
Volkswagen 1

In case one is tempted to add the Google and DeepMind numbers together, note that there is one paper in common, so the total is 16, not 17 for the two in combination.

Affiliation to institution is not an exact science, and is exacerbated by the official accepted papers list here not containing the affiliations of many authors (and in some cases not even being the final list of authors for papers), there being many ways to refer to the same institution, and there being ambiguity of what is a single institution and what is multiple (this is especially difficult for me among French institutions).  I have done my best to find preprints and look at personal websites to fill in and correct institutions given in the accepted papers list. Here is the raw data file that I put together.

h1

Circle of Life

March 18, 2017

Jambo Señor Bernard Lagat!  Greetings from inside the Maasai Mara where I got to see the whole cast of the Circle of Life.  Before coming for my first trip to Africa, my knowledge of the continent, like nearly all Americans, was pretty much completely derived from The Lion King.  As I was psyching myself up for the journey, I played a medley of songs from the film along with “Dry Your Tears, Afrika.”

However, once I got here, it was not Africa that was drying its tears, but me, as I explained to a member of the Clinton Health Access Initiative how the kindness of the American people over the last 70 years has led directly to our family’s and my being in the position we are in.  Whether we take Sam Higginbottom and Mason Vaugh’s Allahabad Agricultural Institute that gave Baba his first job and encouraged his growth, the sponsors that allowed him to come to Illinois for higher studies with Bill Perkins not once but twice, the granters of the tuition waivers that allowed Papa to study there himself, the policymakers whose policies encouraged someone like him to work and gain lawful residence in the U.S., the agency program managers who funded Papa’s research, or the people behind the National Science Foundation Graduate Research Fellowship that allowed both of us to thrive in graduate school, the American people have consistently encouraged achieving the dream through hard work and the rewarding of skill, knowledge, and expertise regardless of caste or creed.

But it seems like we’re going through a “cultural revolution” that some point to having arisen from the hollowing out of the middle class and rising income inequality due to increased automation of jobs by technological solutions.  I don’t think anyone should be promoting extreme income inequality, but solutions should come from the science and technologies themselves rather than from crippling advances in the science and technologies that have gotten us to this point.  As Stefano Ermon says, “It’s very important that we make sure that [artificial intelligence] is really for everybody’s benefit.”  Last October I submitted a proposal for an artificial intelligence grand challenge for IBM Research to work on, ultimately not selected, on exactly this topic: reducing economic inequality.  Given all that has transpired in the intervening months, my belief that such a project should be undertaken has only strengthened.

Here in Kenya, I had the pleasure of visiting the startups Soko and mSurvey who are both doing their part in democratizing production and the flow of information.  Both have developed profitable technology-based solutions that happen to push back against inequalities in the developing world.  Back in March 2013, I had submitted a proposal “Production by the Masses” for inclusion in IBM Research’s longer-term vision for the corporation, also not selected, which has some of the elements that these two companies and others like them epitomize.  However, it also failed to fully anticipate some of the things that have taken hold recently like the ridiculously high value of data and the power of the blockchain’s distributed ledger, and over-emphasized the distinction between rural and urban populations.  I now see that the same sort of stuff is needed everywhere there are inequalities, which is everywhere.

Yes there is an ideal inclusive Circle of Life (fragile enough to be Scarred).  Let us all strive for that ideal by valuing knowledge and by using existing and new science and technology.

20170319_104415

h1

Flying, flying, digging, digging

October 27, 2015

As you know, I’ve been travelling quite a bit the last month or so.  I think I may have put on more miles per unit time than ever before.  While flying around, I read a good number of popular books that I had been meaning to, in the broad area of information science.  For example, I read Bursts by Lazslo Barabasi and learned more about Transylvanian history than I intended.  I also read Social Physics by my one-time collaborator Sandy Pentland, as well as The Life and Work of George Boole: A Prelude to the Digital Age by Desmond MacHale.  I had received this last book as a gift for giving one of the big talks at the When Boole Meets Shannon Workshop at University College Cork in early September.  An extensive biography, it also emphasizes how Boole’s The Laws of Thought makes a strong connection between logic and set theory on the one hand and probability theory on the other, a hundred years before Kolmogorov.  When Boole was reading extracts from the book-in-progress to his wife-to-be Mary Everest, [p. 148]:

She confessed that she felt comforted by the fact that the laws by which the human mind operates were governed by algebraic principles!

Incidentally, in 1868, Mary Boole also wrote a book, The Message of Psychic Science, which had the following rather prescient passage inspired by Babbage’s computer and Jevons’ syllogism evaluator [p. 267]:

Between them they have conclusively proved, by unanswerable logic of facts, that calculation and reasoning, like weaving and ploughing, are work, not for human souls, but for clever combinations of iron and wood.  If you spend time doing work that a machine could do faster than yourselves, it should only be for exercise, as you swing dumb-bells; or for amusement as you dig in your garden; or to soothe your nerves by its mechanicalness, as you take up knitting; not in any hope of so working your way to the truth.

Speaking of iron and wood, one last book I read in my travels is Why Information Grows by Cesar Hidalgo, and a first thing he discusses is how solids are needed to store information.  As he says, close to my heart [p. 34]:

Schrodinger understood that aperiodicity was needed to store information, since a regular crystal would be unable to carry much information.

Let me list my travel venues for you:

And now with that travel done, I think I’ll be going hard on writing and maybe even some theorem-proving and data analytics.  As we’ve discussed, I find blogging to sometimes jump start the writing/doing engines, and so here we go with cities.  

As promised previously, I perform some formal tests for lognormal distributions of house sizes in Mohenjo Daro and in Syracuse. As a starting point, I used the lognfit function in matlab to find the maximum likelihood estimates of the fit parameters and also the 95% confidence intervals.  The two parameters are the mean μ and standard deviation σ of the associated normal distribution.  The estimated value of σ is the square root of the unbiased estimate of the variance of the log of the data.  Rather than showing the rank-frequency plots as in the previous post, let me show the cumulative distribution functions.  Note that in Syracuse data, about 1/5 of houses do not have a listed living area, so I exclude them from this analysis.

cdf_area

cdf_larea

At least visually, these don’t look like the best of fits.  To measure the goodness of fit, I use the chi-square goodness-of-fit test as implemented in matlab as chi2gof.  With data ‘area’ already fit using lognfit into parameter vector ‘parmhat’, this is [h,p] = chi2gof(area,’cdf’,@(z)logncdf(z,parmhat(1),parmhat(2)),’nparams’,2). Despite the visual evidence, the chi-square test does not reject the null hypothesis of lognormality at the 5% confidence level for Mohenjo Daro.  The chi-square test does reject the null hypothesis of lognormality at the 5% confidence level for Syracuse, contrary to the theory of Bettencourt, et al.  I wonder what the explanation might be for this contrary finding in Syracuse: maybe some data fidelity issues?

By the way, I also promised some other nuggets and so here is one: the relationship between living area and value in Syracuse.

fullval_larea.

There is certainly more than just living area that determines value.  In fact, the methodology of assessing house value is an interesting one.  One more nugget is on when houses that existed in Syracuse in July 2011 were built.

yearbuilt_h

I wonder if there is a way to understand this data through a birth-death process model.  There is a nice theoretical paper in this general direction, “Random Fluctuations in the Age-Distribution of a Population Whose Development is Controlled by the Simple “Birth-and-Death” Process,” by David G. Kendall from the J. Royal Statistical Society in 1950.

To close the story with more birth and death, unlike studying Mohenjo Daro, on which there is little information, the difficulty for future historians will certainly be too much data. Before the travels, I finished reading through a popular book Dataclysm by the author of the OKCupid blog, in a sense it is an expanded version of that blog. One of the big things that is pointed out is that there will be growing longitudinal data about individuals due to social media such as Facebook.  Collections of pennants are eventually taken down from bedroom walls, but nothing is taken down from Facebook walls. It uses culturomics.  

As I may have foreshadowed, Ron Kline’s book, The Cybernetic Moment (that I helped with a little bit), also uses culturomics a little bit to measure the nature of discourse.

So that is some flying, flying, and digging, digging from me.  Hope you’ll contribute to the discourse so future historians have more to study.  By the way, the city sizes for the various places (as per Wikipedia today) are, from large to small:

  • 13,216,221 – Tokyo
  • 2,722,389 – Chicago
  • 435,413 – Jeju
  • 84,513 – Champaign
  • 67,947 – Santa Fe
  • 41,250 – Urbana
  • 12,019 – Los Alamos
  • 5,138 – Monticello

Perhaps data for a statistical assessment?

h1

Digging into a city

March 26, 2015

Glad to see your work described previously now appearing in a journal paper, but also glad to know that it is doing social good.  One of the main things you did was look for buildings from satellite imagery, which is really quite a neat thing.  As you know, I have been quite intrigued by the science of cities, and perhaps data from satellite imagery can be useful to make empirical statements there.  Can one see municipal waste remotely?  In anticipation of that, perhaps I can dig through some data on cities that I happen to have, and see if there are interesting statements to be made regarding scaling laws within cities (in contrast to most work that has focused on scaling laws among cities, though I should note the work of Batty, et al.).  As examples, I will consider recent data from our hometown of Syracuse, NY and also data from Mohenjo Daro of the Indus Valley civilization.  

As you can guess, the Syracuse data was gathered from my service on an IBM Smarter Cities Challenge team, by digging through some old servers held by a not-for-profit partner of the City of Syracuse.  The journal paper on that is finally out, but more importantly it seems to be having some social impact.  Here is a newer video on impacts of what we did there.

The data on Mohenjo Daro is from actual digging, rather than digging through computers.  Built around 2600 BCE, Mohenjo Daro was one of the largest settlements of the ancient Indus Valley Civilization and one of the world’s earliest major urban settlements.  Mohenjo Daro was abandoned in the 19th century BCE, and was not rediscovered until 1922.  The data I will use was initially mapped by British archaeologists in the 1930s in their excavation of Mohenjo Daro, and collected in the paper [Anna Sarcina, “A Statistical Assessment of House Patterns at Moenjo Daro,” Mesopotamia Torino, vol. 13-14, pp. 155-199, 1978.].

Before getting to the data, though, let me describe some theoretical work on the distribution of house sizes from a recent paper of Bettencourt, et al. in a new open access journal from AAAS.  From the settlement scaling theory developed, they make a prediction on the distribution of house areas.  In particular, the overall distribution should be approximately lognormal.  This prediction is borne out in archeological data of houses in pre-Hispanic cities in Mexico.  The basic argument for why the lognormal distribution should arise is from a multiplicative generative process and the central limit theorem.  A reference therein attributes the argument back to William Shockley in studying the productivity of scientists, but according to Mitzenmacher it goes back even further.  (Service times in call centers also appear to be approximately lognormal, among other phenomena)

Anyway, coming to our data, let me first show the rank-frequency plot of the surface area (m2) of 183 houses in Mohenjo Daro.

Mohenjo Daro

Now I show the rank-frequency plot of the living area (ft2) of 41804 houses in Syracuse (data from July 2011).Syracuse

What do you think?  Does it look approximately lognormal?  I’ll soon write another blog post with some formal statistical analysis, and some other nuggets from these data sets.

Incidentally, as requested, I seem to be making creativity a part of my research agenda (from an information theory and statistical signal processing perspective).  I spoke about fundamental limits to creativity at the ITA Workshop in San Diego in February (though the talk itself ended up being slightly different than the abstract).  I also organized a special session on computational creativity, which was fun.  

I think someone should connect creativity and cities in some precise informational way, and perhaps you are the man to do it.

h1

Remote Sensing

September 7, 2014

Acquiring information from afar is often quite important, whether to reduce the cost of ground investigation, to get a wider view, or perhaps to conceal surveillance activities.  A couple weeks ago at the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining you had a ‘social good’ paper on using remote sensing data to predict which villages have more poor people than others, based on whether there were more houses with metal roofs or with thatch roofs.  An earlier presentation of this work was given at DataKind, under whose auspices the work was carried out together with the the charity GiveDirectly.

Congratulations on the best paper award for this work!

Incidentally I also enjoyed your work on tennis analytics at the same conference and was therefore glad I attended part of the Large-Scale Sports Analytics workshop, in addition to the data-driven educational assessment workshop I was running.  

Coming back to remote sensing and somewhat related to the last post, remote sensing of waste production can potentially be used to sense alien civilizations.  Though more apropos to your work, apparently night-time light remote sensing is becoming a common approach to poverty detection, thinking of night lights as light pollution.  A few papers in a variety of journals on this topic include this, this, and this.  I wonder, though, whether there is a way to measure “signal pollution” as a way to do remote sensing to build on the idea of information metabolism.  With information pollution, maybe it is low-entropy signals one should look for, rather than high-entropy signals.

Perhaps artistic things you can see from the air?

SFig-2