SDM Stats

May 3, 2018

Hello! The venerable SIAM International Conference on Data Mining (SDM) reconvenes today for its eighteenth edition. The Great AI War of 2018 heads down the Pacific coast. 

Following my previous posts on AISTATS paper counts, ICASSP paper counts, and ICLR paper counts, below are the numbers for accepted SDM papers among companies.

Company Paper Count
Baidu 2
Samsung 2
Adobe 1
Facebook 1
Google 1
LinkedIn 1
NTUC Link 1
PPLive 1
Raytheon 1

My methodology is a manual scan of the printed program.


ICLR Stats

April 27, 2018

Hello bonjour! The venerable International Conference on Learning Representations (ICLR) reconvenes Monday for its sixth edition. The Great AI War of 2018 heads a little west. 

Following my previous posts on AISTATS paper counts broken down by institution and ICASSP paper counts broken down by company, below are the numbers for accepted ICLR main conference papers among the top companies.  Like in my ICASSP stats, Google and DeepMind are not treated separately.

Company Paper Count
Google 68
Microsoft 19
Facebook 14
Salesforce 6
Baidu 4
Intel 3

My methodology this time relied on the data compiled by pajoarthur including his logic for converting email addresses to institutions.  However, I aggregated the numbers differently.  He considered ‘Invite to Workshop Track’ status papers in his counts, whereas I did not.  He evaluated the contribution of an author by dividing by the total number of authors of a paper, and then summing up these partial contributions by company; like I did for AISTATS and ICASSP, I counted a paper for a company if it had at least one author from that company.



April 15, 2018

Annyeong! The venerable IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) reconvenes today for its forty-third edition. The Great AI War of 2018 rolls on. 

If you ask how signal processing has become AI, read my recent essay on the topic.

Following my previous post on AISTATS paper counts broken down by institution, I present the numbers for ICASSP below, but only for companies.  This time around, Google and DeepMind are not treated separately.

Company Paper Count
Google 38
NTT 26
Microsoft 25
IBM 16
Huawei 11
Amazon 9
Mitsubishi 9
Samsung 7
Facebook 6
Tencent 6
Alibaba 5
Apple 5
Ericsson 4
Robert Bosch 4
Starkey Hearing Technologies 4
Tata Consultancy Services 4
SRI International 3
Technicolor 3
Toshiba 3
Adobe 2
Analog Devices 2
GE 2
GN 2
Haliburton 2
Hitachi 2
Intel 2
Orange Labs 2
Origin Wireless 2
Qualcomm 2
Raytheon 2
Sony 2
Spotify 2
Thales 2
Toyota 2

I used the official paper index and did not dig deeper into the papers in any way.



April 9, 2018

Hola! The venerable International Conference on Artificial Intelligence and Statistics (AISTATS) reconvenes today for its twenty-first edition. It has recently become common for there to be blog posts presenting the counts of papers at machine learning conferences broken down by institution.  I had not seen it for AISTATS 2018, so I went ahead and put the numbers together.

Institution Paper Count
MIT 13
UC Berkeley 12
Carnegie Mellon 11
Google 11
Stanford 11
Oxford 9
Princeton 9
Texas 8
Duke 7
Cornell 6
DeepMind 6
Harvard 6
Microsoft 6
Tokyo 6
ETH Zurich 5
Georgia Tech 5
Michigan 5
Purdue 5

Since The Great AI War of 2018 is apparently ongoing, here are the numbers for companies.

Institution Paper Count
Google 11
DeepMind 6
Microsoft 6
Adobe 3
Amazon 2
Baidu 1
Charles River Analytics 1
D. E. Shaw 1
Disney 1
Face++ 1
Facebook 1
Mind Foundry 1
Netflix 1
Prowler.io 1
SigOpt 1
Snap 1
Tencent 1
Vicarious 1
Volkswagen 1

In case one is tempted to add the Google and DeepMind numbers together, note that there is one paper in common, so the total is 16, not 17 for the two in combination.

Affiliation to institution is not an exact science, and is exacerbated by the official accepted papers list here not containing the affiliations of many authors (and in some cases not even being the final list of authors for papers), there being many ways to refer to the same institution, and there being ambiguity of what is a single institution and what is multiple (this is especially difficult for me among French institutions).  I have done my best to find preprints and look at personal websites to fill in and correct institutions given in the accepted papers list. Here is the raw data file that I put together.


Circle of Life

March 18, 2017

Jambo Señor Bernard Lagat!  Greetings from inside the Maasai Mara where I got to see the whole cast of the Circle of Life.  Before coming for my first trip to Africa, my knowledge of the continent, like nearly all Americans, was pretty much completely derived from The Lion King.  As I was psyching myself up for the journey, I played a medley of songs from the film along with “Dry Your Tears, Afrika.”

However, once I got here, it was not Africa that was drying its tears, but me, as I explained to a member of the Clinton Health Access Initiative how the kindness of the American people over the last 70 years has led directly to our family’s and my being in the position we are in.  Whether we take Sam Higginbottom and Mason Vaugh’s Allahabad Agricultural Institute that gave Baba his first job and encouraged his growth, the sponsors that allowed him to come to Illinois for higher studies with Bill Perkins not once but twice, the granters of the tuition waivers that allowed Papa to study there himself, the policymakers whose policies encouraged someone like him to work and gain lawful residence in the U.S., the agency program managers who funded Papa’s research, or the people behind the National Science Foundation Graduate Research Fellowship that allowed both of us to thrive in graduate school, the American people have consistently encouraged achieving the dream through hard work and the rewarding of skill, knowledge, and expertise regardless of caste or creed.

But it seems like we’re going through a “cultural revolution” that some point to having arisen from the hollowing out of the middle class and rising income inequality due to increased automation of jobs by technological solutions.  I don’t think anyone should be promoting extreme income inequality, but solutions should come from the science and technologies themselves rather than from crippling advances in the science and technologies that have gotten us to this point.  As Stefano Ermon says, “It’s very important that we make sure that [artificial intelligence] is really for everybody’s benefit.”  Last October I submitted a proposal for an artificial intelligence grand challenge for IBM Research to work on, ultimately not selected, on exactly this topic: reducing economic inequality.  Given all that has transpired in the intervening months, my belief that such a project should be undertaken has only strengthened.

Here in Kenya, I had the pleasure of visiting the startups Soko and mSurvey who are both doing their part in democratizing production and the flow of information.  Both have developed profitable technology-based solutions that happen to push back against inequalities in the developing world.  Back in March 2013, I had submitted a proposal “Production by the Masses” for inclusion in IBM Research’s longer-term vision for the corporation, also not selected, which has some of the elements that these two companies and others like them epitomize.  However, it also failed to fully anticipate some of the things that have taken hold recently like the ridiculously high value of data and the power of the blockchain’s distributed ledger, and over-emphasized the distinction between rural and urban populations.  I now see that the same sort of stuff is needed everywhere there are inequalities, which is everywhere.

Yes there is an ideal inclusive Circle of Life (fragile enough to be Scarred).  Let us all strive for that ideal by valuing knowledge and by using existing and new science and technology.



Flying, flying, digging, digging

October 27, 2015

As you know, I’ve been travelling quite a bit the last month or so.  I think I may have put on more miles per unit time than ever before.  While flying around, I read a good number of popular books that I had been meaning to, in the broad area of information science.  For example, I read Bursts by Lazslo Barabasi and learned more about Transylvanian history than I intended.  I also read Social Physics by my one-time collaborator Sandy Pentland, as well as The Life and Work of George Boole: A Prelude to the Digital Age by Desmond MacHale.  I had received this last book as a gift for giving one of the big talks at the When Boole Meets Shannon Workshop at University College Cork in early September.  An extensive biography, it also emphasizes how Boole’s The Laws of Thought makes a strong connection between logic and set theory on the one hand and probability theory on the other, a hundred years before Kolmogorov.  When Boole was reading extracts from the book-in-progress to his wife-to-be Mary Everest, [p. 148]:

She confessed that she felt comforted by the fact that the laws by which the human mind operates were governed by algebraic principles!

Incidentally, in 1868, Mary Boole also wrote a book, The Message of Psychic Science, which had the following rather prescient passage inspired by Babbage’s computer and Jevons’ syllogism evaluator [p. 267]:

Between them they have conclusively proved, by unanswerable logic of facts, that calculation and reasoning, like weaving and ploughing, are work, not for human souls, but for clever combinations of iron and wood.  If you spend time doing work that a machine could do faster than yourselves, it should only be for exercise, as you swing dumb-bells; or for amusement as you dig in your garden; or to soothe your nerves by its mechanicalness, as you take up knitting; not in any hope of so working your way to the truth.

Speaking of iron and wood, one last book I read in my travels is Why Information Grows by Cesar Hidalgo, and a first thing he discusses is how solids are needed to store information.  As he says, close to my heart [p. 34]:

Schrodinger understood that aperiodicity was needed to store information, since a regular crystal would be unable to carry much information.

Let me list my travel venues for you:

And now with that travel done, I think I’ll be going hard on writing and maybe even some theorem-proving and data analytics.  As we’ve discussed, I find blogging to sometimes jump start the writing/doing engines, and so here we go with cities.  

As promised previously, I perform some formal tests for lognormal distributions of house sizes in Mohenjo Daro and in Syracuse. As a starting point, I used the lognfit function in matlab to find the maximum likelihood estimates of the fit parameters and also the 95% confidence intervals.  The two parameters are the mean μ and standard deviation σ of the associated normal distribution.  The estimated value of σ is the square root of the unbiased estimate of the variance of the log of the data.  Rather than showing the rank-frequency plots as in the previous post, let me show the cumulative distribution functions.  Note that in Syracuse data, about 1/5 of houses do not have a listed living area, so I exclude them from this analysis.



At least visually, these don’t look like the best of fits.  To measure the goodness of fit, I use the chi-square goodness-of-fit test as implemented in matlab as chi2gof.  With data ‘area’ already fit using lognfit into parameter vector ‘parmhat’, this is [h,p] = chi2gof(area,’cdf’,@(z)logncdf(z,parmhat(1),parmhat(2)),’nparams’,2). Despite the visual evidence, the chi-square test does not reject the null hypothesis of lognormality at the 5% confidence level for Mohenjo Daro.  The chi-square test does reject the null hypothesis of lognormality at the 5% confidence level for Syracuse, contrary to the theory of Bettencourt, et al.  I wonder what the explanation might be for this contrary finding in Syracuse: maybe some data fidelity issues?

By the way, I also promised some other nuggets and so here is one: the relationship between living area and value in Syracuse.


There is certainly more than just living area that determines value.  In fact, the methodology of assessing house value is an interesting one.  One more nugget is on when houses that existed in Syracuse in July 2011 were built.


I wonder if there is a way to understand this data through a birth-death process model.  There is a nice theoretical paper in this general direction, “Random Fluctuations in the Age-Distribution of a Population Whose Development is Controlled by the Simple “Birth-and-Death” Process,” by David G. Kendall from the J. Royal Statistical Society in 1950.

To close the story with more birth and death, unlike studying Mohenjo Daro, on which there is little information, the difficulty for future historians will certainly be too much data. Before the travels, I finished reading through a popular book Dataclysm by the author of the OKCupid blog, in a sense it is an expanded version of that blog. One of the big things that is pointed out is that there will be growing longitudinal data about individuals due to social media such as Facebook.  Collections of pennants are eventually taken down from bedroom walls, but nothing is taken down from Facebook walls. It uses culturomics.  

As I may have foreshadowed, Ron Kline’s book, The Cybernetic Moment (that I helped with a little bit), also uses culturomics a little bit to measure the nature of discourse.

So that is some flying, flying, and digging, digging from me.  Hope you’ll contribute to the discourse so future historians have more to study.  By the way, the city sizes for the various places (as per Wikipedia today) are, from large to small:

  • 13,216,221 – Tokyo
  • 2,722,389 – Chicago
  • 435,413 – Jeju
  • 84,513 – Champaign
  • 67,947 – Santa Fe
  • 41,250 – Urbana
  • 12,019 – Los Alamos
  • 5,138 – Monticello

Perhaps data for a statistical assessment?


Digging into a city

March 26, 2015

Glad to see your work described previously now appearing in a journal paper, but also glad to know that it is doing social good.  One of the main things you did was look for buildings from satellite imagery, which is really quite a neat thing.  As you know, I have been quite intrigued by the science of cities, and perhaps data from satellite imagery can be useful to make empirical statements there.  Can one see municipal waste remotely?  In anticipation of that, perhaps I can dig through some data on cities that I happen to have, and see if there are interesting statements to be made regarding scaling laws within cities (in contrast to most work that has focused on scaling laws among cities, though I should note the work of Batty, et al.).  As examples, I will consider recent data from our hometown of Syracuse, NY and also data from Mohenjo Daro of the Indus Valley civilization.  

As you can guess, the Syracuse data was gathered from my service on an IBM Smarter Cities Challenge team, by digging through some old servers held by a not-for-profit partner of the City of Syracuse.  The journal paper on that is finally out, but more importantly it seems to be having some social impact.  Here is a newer video on impacts of what we did there.

The data on Mohenjo Daro is from actual digging, rather than digging through computers.  Built around 2600 BCE, Mohenjo Daro was one of the largest settlements of the ancient Indus Valley Civilization and one of the world’s earliest major urban settlements.  Mohenjo Daro was abandoned in the 19th century BCE, and was not rediscovered until 1922.  The data I will use was initially mapped by British archaeologists in the 1930s in their excavation of Mohenjo Daro, and collected in the paper [Anna Sarcina, “A Statistical Assessment of House Patterns at Moenjo Daro,” Mesopotamia Torino, vol. 13-14, pp. 155-199, 1978.].

Before getting to the data, though, let me describe some theoretical work on the distribution of house sizes from a recent paper of Bettencourt, et al. in a new open access journal from AAAS.  From the settlement scaling theory developed, they make a prediction on the distribution of house areas.  In particular, the overall distribution should be approximately lognormal.  This prediction is borne out in archeological data of houses in pre-Hispanic cities in Mexico.  The basic argument for why the lognormal distribution should arise is from a multiplicative generative process and the central limit theorem.  A reference therein attributes the argument back to William Shockley in studying the productivity of scientists, but according to Mitzenmacher it goes back even further.  (Service times in call centers also appear to be approximately lognormal, among other phenomena)

Anyway, coming to our data, let me first show the rank-frequency plot of the surface area (m2) of 183 houses in Mohenjo Daro.

Mohenjo Daro

Now I show the rank-frequency plot of the living area (ft2) of 41804 houses in Syracuse (data from July 2011).Syracuse

What do you think?  Does it look approximately lognormal?  I’ll soon write another blog post with some formal statistical analysis, and some other nuggets from these data sets.

Incidentally, as requested, I seem to be making creativity a part of my research agenda (from an information theory and statistical signal processing perspective).  I spoke about fundamental limits to creativity at the ITA Workshop in San Diego in February (though the talk itself ended up being slightly different than the abstract).  I also organized a special session on computational creativity, which was fun.  

I think someone should connect creativity and cities in some precise informational way, and perhaps you are the man to do it.