The Pandemic Bandwagon

March 21, 2020

A Wuhan-shake to you señor. Hope you’re doing alright with the shelter-in-place order for Santa Clara County and California. We’re on pause here in New York.

Yesterday morning, I was all psyched up to do a blog post and accompanying twitter thread on 12 Data Science Problems in Mitigating and Managing Pandemics Like COVID-19 that would go through several issues related to the crisis that have some avenue for data science (broadly construed) to contribute to. I’ve been glued to twitter the last few evenings and a lot of different people have been posting various things. I have things to share, I thought, so why not me?

A wise person asked me to reflect on whether it would be a sensible thing to do. She emphasized that “there are so many people who are jumping on the bandwagon trying to help. Some mean well while some are capitalizing on the situation. And of those that mean well, some are offering silly things.” As you’ve told me on occasion, Shannon was wary of the bandwagon as well, and much preferred the “slow tedious process of hypothesis and experimental validation.” He noted that “a few first rate research papers are preferable to a large number that are poorly conceived or half-finished.” What would he have said to streams of consciousness offered up 280 characters at a time? Adam Rogers wrote yesterday afternoon that “the chatter about a promising drug to fight Covid-19 started, as chatter often does (but science does not), on Twitter.”

I woke up this morning wishing for science, not chatter. I realized that I am not among “men ages 24–36 working in tech” predisposed to “armchair epidemiology.” I turned 37 a whopping five months ago!

Rogers continued: “Silicon Valley lionizes people who rush toward solutions and ignore problems; science is designed to find solutions by identifying those problems.”

So lets talk about problems and how run-of-the-mill data scientists working in isolation, both literally and figuratively, usually lack the requisite problem understanding to make the right contribution.

In dealing with global disease outbreaks, such as the ongoing novel coronavirus pandemic, we can imagine four main opportunities to help: surveillance, testing, management, and cure. We are primarily concerned with zoonotic diseases: diseases that transfer from animals to humans.  By surveilling, we mean tools and techniques for predicting or providing early warnings of outbreaks of novel or known pathogens. By testing, we mean diagnosing individual patients with the disease. By managing, we mean the tools and techniques for better understanding and limiting the spread of the outbreak, providing care, and engaging the citizenry.  By curing, we mean the development of therapeutic agents to administer to infected individuals. In all of these areas, the lone data scientist working without true problem understanding can be misguided at best and detrimental at worst.


  1. Zoonotic pathogen prediction. There are a large number of known pathogens, but for most of them, it is not known whether they can transfer from animals into humans (and develop into outbreaks). It may be possible to predict the likely candidates by training on features of known zoonotic pathogens. We tried doing it a few years ago in partnership with disease ecologist Barbara Han who defined the relevant features, but didn’t get very far because the features of pathogens are not available in a nice clean tabular dataset; they are locked up inside scientific publications. Knowledge extraction from these very specialized documents automatically requires a lot of expert ecologist-annotated documents, which is not tenable. Even if we were able to pull together a dataset suitable for predicted zoonoses, we wouldn’t know how to make heads or tails of the results without the disease ecologists.
  2. Informed spillover surveillance. Once a pathogen is known as a zoonotic disease and has had an outbreak, it is important to monitor it for future outbreaks or spillovers. Reservoir species harbor pathogens without having symptoms and without dying, waiting for a vector to carry the disease to humans and start another outbreak. In the first year of the IBM Science for Social Good initiative, we partnered with the same disease ecologist to develop algorithms for predicting the reservoir species of primates for Zika virus in the Americas so that populations of those species could be further scrutinized and monitored. Without Barbara, we would have had no clue about what problem to solve, what data sources to trust, how to overcome severe class imbalance in the prediction task (by combining data from other viruses in the same family), and how the predictions could inform policy.
  3. Outbreak early warning. The earlier we know that an outbreak is starting, the earlier actions can be taken to contain it. There are often small signals in various reports and other data that indicate a disaster is beginning. BlueDot knew something was up with the novel coronavirus as early as December 30, 2019, but they’ve been at this for quite a while and have a team that includes veterinarians, doctors, and epidemiologists. Even then, their warnings were not heeded as strongly as they could have been.


  1. Group testing. There are shortages of COVID-19 tests in certain places. Well-meaning data scientists ask the question: isn’t there a smart way to test more people with the same number of tests (and I’ve seen it asked several different times already, including in an email that a friend from grad school sent both of us). Eventually, someone points out the method of group testing, which has been known since WWII. But even that is not the solution for the current method of testing (PCR). You pointed out in your response to the friend that group testing would require a serological test for COVID-19, which isn’t ready yet. A case of solving a problem with an already known solution that is actually not a relevant problem.
  2. Deep learning from CT images. Deep neural networks have achieved better accuracy than expert physicians in several medical imaging tasks in radiology, dermatology, ophthalmology, and pathology, so it is natural that several groups would try training them for diagnosing COVID-19. Again, a well-meaning effort, but sometimes not executed very well. E.g. this paper uses CT images of COVID-19-confirmed patients from China as the positive class and images of healthy people from the United States as the negative class — which may introduce spurious correlations and artificially inflate the accuracy.  Even if this task is well done, will it find its way into clinical practice?  That has not yet been the case in those tasks mentioned above despite the initial demonstrations having happened several years ago.
  3. Classifying breathing patterns. A paper posted to arXiv with the title Abnormal respiratory patterns classifier may contribute to large-scale screening of people infected with COVID-19 in an accurate and unobtrusive manner claims that “According to the latest clinical research, the respiratory pattern of COVID-19 is different from the respiratory patterns of flu and the common cold. One significant symptom that occurs in the COVID-19 is Tachypnea. People infected with COVID-19 have more rapid respiration.” but the authors provide no reference to this clinical research and I haven’t been able to track it down myself. If there isn’t really any distinguishing difference between respiration patterns with flu and COVID-19, then this work is in vain, and could have been avoided by conferring with clinicians.


  1. Spatiotemporal epidemiological modeling. Once an outbreak has started, it is important to model its spread to inform decision making in the response. This is the purview of epidemiology and has a lot of nuance to it. Small differences in the input can yield large differences in the output. This should be left to the experts who have been doing it for many years.
  2. Data-driven decision making. Another aspect to managing an outbreak is collecting primary (e.g. case counts), secondary (e.g. hospital beds, personal protective equipment), and tertiary (e.g. transportation and other infrastructure) information. This is highly challenging and in a disaster situation requires both formal and informal means. During the 2014 Ebola outbreak, we observed that there was a lot of enthusiasm for collecting, collating, and visualizing the case counts, but not so much for the secondary and tertiary information, which, according to the true experts, is really the most important for managing the situation. The same focus on the former is true now, but at least there is some focus on the latter. Enthusiasm is great, but better when directed to the important problems.
  3. Engaging the public. In managing outbreaks, it is critical to inform the public of best practices to limit the person-to-person spread of the disease (which may go against cultural norms) and also to receive information from the situation on the ground. This has been done to effect in the past such as during the Ebola outbreak and in certain places now, but seems to be lacking in many other places. Misinformation and disinformation in peer-to-peer and social network platforms appears to be rampant, but there seems to be little ‘tech solutioning’ in this space so far – perhaps the energy is being spent elsewhere.


  1. Drug repurposing. Interestingly, drugs developed for particular diseases also have therapeutic effect on other diseases. For example, chloroquine, an old malaria drug has an effect on certain cancers and anecdotally seems to show an effect on the novel coronavirus. By finding such old generic drugs whose safety has already been tested and which might be inexpensive and already in large supply, we can quickly start tamping down an outbreak after the therapeutic effect is confirmed in a large-scale clinical trial. But such findings of repurposing are difficult to notice at large scale without the use of natural language processing of scientific publications. A consortium recently released a collection of 29,000 scientific publications related to COVID-19 (CORD-19), but there is very little guidance for NLP researchers on what to do with that data and no subject matter expert support. Therefore, it seems unlikely that anything of much use will come out of it.
  2. Novel drug generation and discovery. Repurposing has its limits; we must also discover completely new drugs for new diseases. State-of-the-art generative modeling approaches have begun that journey, but are currently difficult to control. And moreover, consulting subject matter experts is required to figure out what desirable properties to control for in the generation: things like toxicity and solubility. Finally, generating sequences of candidate drugs in silico only makes sense if there is close coupling with laboratories that can actually synthesize and test the candidates.

In my originally envisioned post, I was going to end with a sort of cute twelfth item: staying at home.  Apart from lumberjacks, data scientists are among the professions most suited to not spreading the coronavirus according to this data presented by the New York Times. But in fact, this is not merely a cute conclusion: it is the one right contribution that data scientists can truly make well while in isolation off the bandwagon. When the fog clears, however, lets be deliberate and work interdisciplinarily to create full, well thought out, and tested solutions for mitigating and managing global pandemics.



June 3, 2018

Comment ça se plume? The venerable Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) reconvenes this week. The Great AI War of 2018 revisits New Orleans for another skirmish. 

Following my previous posts on AISTATS paper counts, ICASSP paper counts, ICLR paper counts, and SDM paper counts, below are the numbers for accepted NAACL papers among companies for long papers, short papers, industry papers, and all combined.

Company Paper Count (Long)
Microsoft 10
Amazon 5
Facebook 5
Tencent 4
DeepMind 3
Google 3
JD 3
Adobe 2
Elemental Cognition 2
PolyAI 2
Siemens 2
Agolo 1
Aylien 1
Bloomberg 1
Bytedance 1
Choosito 1
Data Cowboys 1
Educational Testing Service 1
Fuji Xerox 1
Grammarly 1
Huawei 1
Improva 1
Interactions 1
Intuit 1
Philips 1
Samsung 1
Snap 1
Synyi 1
Thomson Reuters 1
Tricorn (Beijing) Technology 1
Company Paper Count (Short)
Google 3
Microsoft 3
Facebook 2
Adobe 1
Alibaba 1
Amazon 1
Ant Financial Services 1
Bloomberg 1
Educational Testing Service 1
Infosys 1
PolyAI 1
Preferred Networks 1
Roam Analytics 1
Robert Bosch 1
Samsung 1
Tencent 1
Thomson Reuters 1
Volkswagen 1
Company Paper Count (Industry)
Amazon 6
eBay 4
Airbnb 1
Boeing 1
Clinc 1
Educational Testing Service 1
Google 1
Interactions 1
Microsoft 1
Nuance 1
ZEIT online 1
Company Paper Count (Total)
Microsoft 14
Amazon 12
IBM 10
Facebook 7
Google 7
Tencent 5
eBay 4
Adobe 3
DeepMind 3
Educational Testing Service 3
JD 3
PolyAI 3
Bloomberg 2
Elemental Cognition 2
Interactions 2
Samsung 2
Siemens 2
Thomson Reuters 2

My methodology was to click on all the pdfs in the proceedings and manually note affiliations.


SDM Stats

May 3, 2018

Hello! The venerable SIAM International Conference on Data Mining (SDM) reconvenes today for its eighteenth edition. The Great AI War of 2018 heads down the Pacific coast. 

Following my previous posts on AISTATS paper counts, ICASSP paper counts, and ICLR paper counts, below are the numbers for accepted SDM papers among companies.

Company Paper Count
Baidu 2
Samsung 2
Adobe 1
Facebook 1
Google 1
LinkedIn 1
NTUC Link 1
PPLive 1
Raytheon 1

My methodology is a manual scan of the printed program.


ICLR Stats

April 27, 2018

Hello bonjour! The venerable International Conference on Learning Representations (ICLR) reconvenes Monday for its sixth edition. The Great AI War of 2018 heads a little west. 

Following my previous posts on AISTATS paper counts broken down by institution and ICASSP paper counts broken down by company, below are the numbers for accepted ICLR main conference papers among the top companies.  Like in my ICASSP stats, Google and DeepMind are not treated separately.

Company Paper Count
Google 68
Microsoft 19
Facebook 14
Salesforce 6
Baidu 4
Intel 3

My methodology this time relied on the data compiled by pajoarthur including his logic for converting email addresses to institutions.  However, I aggregated the numbers differently.  He considered ‘Invite to Workshop Track’ status papers in his counts, whereas I did not.  He evaluated the contribution of an author by dividing by the total number of authors of a paper, and then summing up these partial contributions by company; like I did for AISTATS and ICASSP, I counted a paper for a company if it had at least one author from that company.



April 15, 2018

Annyeong! The venerable IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) reconvenes today for its forty-third edition. The Great AI War of 2018 rolls on. 

If you ask how signal processing has become AI, read my recent essay on the topic.

Following my previous post on AISTATS paper counts broken down by institution, I present the numbers for ICASSP below, but only for companies.  This time around, Google and DeepMind are not treated separately.

Company Paper Count
Google 38
NTT 26
Microsoft 25
IBM 16
Huawei 11
Amazon 9
Mitsubishi 9
Samsung 7
Facebook 6
Tencent 6
Alibaba 5
Apple 5
Ericsson 4
Robert Bosch 4
Starkey Hearing Technologies 4
Tata Consultancy Services 4
SRI International 3
Technicolor 3
Toshiba 3
Adobe 2
Analog Devices 2
GE 2
GN 2
Haliburton 2
Hitachi 2
Intel 2
Orange Labs 2
Origin Wireless 2
Qualcomm 2
Raytheon 2
Sony 2
Spotify 2
Thales 2
Toyota 2

I used the official paper index and did not dig deeper into the papers in any way.



April 9, 2018

Hola! The venerable International Conference on Artificial Intelligence and Statistics (AISTATS) reconvenes today for its twenty-first edition. It has recently become common for there to be blog posts presenting the counts of papers at machine learning conferences broken down by institution.  I had not seen it for AISTATS 2018, so I went ahead and put the numbers together.

Institution Paper Count
MIT 13
UC Berkeley 12
Carnegie Mellon 11
Google 11
Stanford 11
Oxford 9
Princeton 9
Texas 8
Duke 7
Cornell 6
DeepMind 6
Harvard 6
Microsoft 6
Tokyo 6
ETH Zurich 5
Georgia Tech 5
Michigan 5
Purdue 5

Since The Great AI War of 2018 is apparently ongoing, here are the numbers for companies.

Institution Paper Count
Google 11
DeepMind 6
Microsoft 6
Adobe 3
Amazon 2
Baidu 1
Charles River Analytics 1
D. E. Shaw 1
Disney 1
Face++ 1
Facebook 1
Mind Foundry 1
Netflix 1
Prowler.io 1
SigOpt 1
Snap 1
Tencent 1
Vicarious 1
Volkswagen 1

In case one is tempted to add the Google and DeepMind numbers together, note that there is one paper in common, so the total is 16, not 17 for the two in combination.

Affiliation to institution is not an exact science, and is exacerbated by the official accepted papers list here not containing the affiliations of many authors (and in some cases not even being the final list of authors for papers), there being many ways to refer to the same institution, and there being ambiguity of what is a single institution and what is multiple (this is especially difficult for me among French institutions).  I have done my best to find preprints and look at personal websites to fill in and correct institutions given in the accepted papers list. Here is the raw data file that I put together.


Circle of Life

March 18, 2017

Jambo Señor Bernard Lagat!  Greetings from inside the Maasai Mara where I got to see the whole cast of the Circle of Life.  Before coming for my first trip to Africa, my knowledge of the continent, like nearly all Americans, was pretty much completely derived from The Lion King.  As I was psyching myself up for the journey, I played a medley of songs from the film along with “Dry Your Tears, Afrika.”

However, once I got here, it was not Africa that was drying its tears, but me, as I explained to a member of the Clinton Health Access Initiative how the kindness of the American people over the last 70 years has led directly to our family’s and my being in the position we are in.  Whether we take Sam Higginbottom and Mason Vaugh’s Allahabad Agricultural Institute that gave Baba his first job and encouraged his growth, the sponsors that allowed him to come to Illinois for higher studies with Bill Perkins not once but twice, the granters of the tuition waivers that allowed Papa to study there himself, the policymakers whose policies encouraged someone like him to work and gain lawful residence in the U.S., the agency program managers who funded Papa’s research, or the people behind the National Science Foundation Graduate Research Fellowship that allowed both of us to thrive in graduate school, the American people have consistently encouraged achieving the dream through hard work and the rewarding of skill, knowledge, and expertise regardless of caste or creed.

But it seems like we’re going through a “cultural revolution” that some point to having arisen from the hollowing out of the middle class and rising income inequality due to increased automation of jobs by technological solutions.  I don’t think anyone should be promoting extreme income inequality, but solutions should come from the science and technologies themselves rather than from crippling advances in the science and technologies that have gotten us to this point.  As Stefano Ermon says, “It’s very important that we make sure that [artificial intelligence] is really for everybody’s benefit.”  Last October I submitted a proposal for an artificial intelligence grand challenge for IBM Research to work on, ultimately not selected, on exactly this topic: reducing economic inequality.  Given all that has transpired in the intervening months, my belief that such a project should be undertaken has only strengthened.

Here in Kenya, I had the pleasure of visiting the startups Soko and mSurvey who are both doing their part in democratizing production and the flow of information.  Both have developed profitable technology-based solutions that happen to push back against inequalities in the developing world.  Back in March 2013, I had submitted a proposal “Production by the Masses” for inclusion in IBM Research’s longer-term vision for the corporation, also not selected, which has some of the elements that these two companies and others like them epitomize.  However, it also failed to fully anticipate some of the things that have taken hold recently like the ridiculously high value of data and the power of the blockchain’s distributed ledger, and over-emphasized the distinction between rural and urban populations.  I now see that the same sort of stuff is needed everywhere there are inequalities, which is everywhere.

Yes there is an ideal inclusive Circle of Life (fragile enough to be Scarred).  Let us all strive for that ideal by valuing knowledge and by using existing and new science and technology.