Archive for March, 2020


The Pandemic Bandwagon

March 21, 2020

A Wuhan-shake to you señor. Hope you’re doing alright with the shelter-in-place order for Santa Clara County and California. We’re on pause here in New York.

Yesterday morning, I was all psyched up to do a blog post and accompanying twitter thread on 12 Data Science Problems in Mitigating and Managing Pandemics Like COVID-19 that would go through several issues related to the crisis that have some avenue for data science (broadly construed) to contribute to. I’ve been glued to twitter the last few evenings and a lot of different people have been posting various things. I have things to share, I thought, so why not me?

A wise person asked me to reflect on whether it would be a sensible thing to do. She emphasized that “there are so many people who are jumping on the bandwagon trying to help. Some mean well while some are capitalizing on the situation. And of those that mean well, some are offering silly things.” As you’ve told me on occasion, Shannon was wary of the bandwagon as well, and much preferred the “slow tedious process of hypothesis and experimental validation.” He noted that “a few first rate research papers are preferable to a large number that are poorly conceived or half-finished.” What would he have said to streams of consciousness offered up 280 characters at a time? Adam Rogers wrote yesterday afternoon that “the chatter about a promising drug to fight Covid-19 started, as chatter often does (but science does not), on Twitter.”

I woke up this morning wishing for science, not chatter. I realized that I am not among “men ages 24–36 working in tech” predisposed to “armchair epidemiology.” I turned 37 a whopping five months ago!

Rogers continued: “Silicon Valley lionizes people who rush toward solutions and ignore problems; science is designed to find solutions by identifying those problems.”

So lets talk about problems and how run-of-the-mill data scientists working in isolation, both literally and figuratively, usually lack the requisite problem understanding to make the right contribution.

In dealing with global disease outbreaks, such as the ongoing novel coronavirus pandemic, we can imagine four main opportunities to help: surveillance, testing, management, and cure. We are primarily concerned with zoonotic diseases: diseases that transfer from animals to humans.  By surveilling, we mean tools and techniques for predicting or providing early warnings of outbreaks of novel or known pathogens. By testing, we mean diagnosing individual patients with the disease. By managing, we mean the tools and techniques for better understanding and limiting the spread of the outbreak, providing care, and engaging the citizenry.  By curing, we mean the development of therapeutic agents to administer to infected individuals. In all of these areas, the lone data scientist working without true problem understanding can be misguided at best and detrimental at worst.


  1. Zoonotic pathogen prediction. There are a large number of known pathogens, but for most of them, it is not known whether they can transfer from animals into humans (and develop into outbreaks). It may be possible to predict the likely candidates by training on features of known zoonotic pathogens. We tried doing it a few years ago in partnership with disease ecologist Barbara Han who defined the relevant features, but didn’t get very far because the features of pathogens are not available in a nice clean tabular dataset; they are locked up inside scientific publications. Knowledge extraction from these very specialized documents automatically requires a lot of expert ecologist-annotated documents, which is not tenable. Even if we were able to pull together a dataset suitable for predicted zoonoses, we wouldn’t know how to make heads or tails of the results without the disease ecologists.
  2. Informed spillover surveillance. Once a pathogen is known as a zoonotic disease and has had an outbreak, it is important to monitor it for future outbreaks or spillovers. Reservoir species harbor pathogens without having symptoms and without dying, waiting for a vector to carry the disease to humans and start another outbreak. In the first year of the IBM Science for Social Good initiative, we partnered with the same disease ecologist to develop algorithms for predicting the reservoir species of primates for Zika virus in the Americas so that populations of those species could be further scrutinized and monitored. Without Barbara, we would have had no clue about what problem to solve, what data sources to trust, how to overcome severe class imbalance in the prediction task (by combining data from other viruses in the same family), and how the predictions could inform policy.
  3. Outbreak early warning. The earlier we know that an outbreak is starting, the earlier actions can be taken to contain it. There are often small signals in various reports and other data that indicate a disaster is beginning. BlueDot knew something was up with the novel coronavirus as early as December 30, 2019, but they’ve been at this for quite a while and have a team that includes veterinarians, doctors, and epidemiologists. Even then, their warnings were not heeded as strongly as they could have been.


  1. Group testing. There are shortages of COVID-19 tests in certain places. Well-meaning data scientists ask the question: isn’t there a smart way to test more people with the same number of tests (and I’ve seen it asked several different times already, including in an email that a friend from grad school sent both of us). Eventually, someone points out the method of group testing, which has been known since WWII. But even that is not the solution for the current method of testing (PCR). You pointed out in your response to the friend that group testing would require a serological test for COVID-19, which isn’t ready yet. A case of solving a problem with an already known solution that is actually not a relevant problem.
  2. Deep learning from CT images. Deep neural networks have achieved better accuracy than expert physicians in several medical imaging tasks in radiology, dermatology, ophthalmology, and pathology, so it is natural that several groups would try training them for diagnosing COVID-19. Again, a well-meaning effort, but sometimes not executed very well. E.g. this paper uses CT images of COVID-19-confirmed patients from China as the positive class and images of healthy people from the United States as the negative class — which may introduce spurious correlations and artificially inflate the accuracy.  Even if this task is well done, will it find its way into clinical practice?  That has not yet been the case in those tasks mentioned above despite the initial demonstrations having happened several years ago.
  3. Classifying breathing patterns. A paper posted to arXiv with the title Abnormal respiratory patterns classifier may contribute to large-scale screening of people infected with COVID-19 in an accurate and unobtrusive manner claims that “According to the latest clinical research, the respiratory pattern of COVID-19 is different from the respiratory patterns of flu and the common cold. One significant symptom that occurs in the COVID-19 is Tachypnea. People infected with COVID-19 have more rapid respiration.” but the authors provide no reference to this clinical research and I haven’t been able to track it down myself. If there isn’t really any distinguishing difference between respiration patterns with flu and COVID-19, then this work is in vain, and could have been avoided by conferring with clinicians.


  1. Spatiotemporal epidemiological modeling. Once an outbreak has started, it is important to model its spread to inform decision making in the response. This is the purview of epidemiology and has a lot of nuance to it. Small differences in the input can yield large differences in the output. This should be left to the experts who have been doing it for many years.
  2. Data-driven decision making. Another aspect to managing an outbreak is collecting primary (e.g. case counts), secondary (e.g. hospital beds, personal protective equipment), and tertiary (e.g. transportation and other infrastructure) information. This is highly challenging and in a disaster situation requires both formal and informal means. During the 2014 Ebola outbreak, we observed that there was a lot of enthusiasm for collecting, collating, and visualizing the case counts, but not so much for the secondary and tertiary information, which, according to the true experts, is really the most important for managing the situation. The same focus on the former is true now, but at least there is some focus on the latter. Enthusiasm is great, but better when directed to the important problems.
  3. Engaging the public. In managing outbreaks, it is critical to inform the public of best practices to limit the person-to-person spread of the disease (which may go against cultural norms) and also to receive information from the situation on the ground. This has been done to effect in the past such as during the Ebola outbreak and in certain places now, but seems to be lacking in many other places. Misinformation and disinformation in peer-to-peer and social network platforms appears to be rampant, but there seems to be little ‘tech solutioning’ in this space so far – perhaps the energy is being spent elsewhere.


  1. Drug repurposing. Interestingly, drugs developed for particular diseases also have therapeutic effect on other diseases. For example, chloroquine, an old malaria drug has an effect on certain cancers and anecdotally seems to show an effect on the novel coronavirus. By finding such old generic drugs whose safety has already been tested and which might be inexpensive and already in large supply, we can quickly start tamping down an outbreak after the therapeutic effect is confirmed in a large-scale clinical trial. But such findings of repurposing are difficult to notice at large scale without the use of natural language processing of scientific publications. A consortium recently released a collection of 29,000 scientific publications related to COVID-19 (CORD-19), but there is very little guidance for NLP researchers on what to do with that data and no subject matter expert support. Therefore, it seems unlikely that anything of much use will come out of it.
  2. Novel drug generation and discovery. Repurposing has its limits; we must also discover completely new drugs for new diseases. State-of-the-art generative modeling approaches have begun that journey, but are currently difficult to control. And moreover, consulting subject matter experts is required to figure out what desirable properties to control for in the generation: things like toxicity and solubility. Finally, generating sequences of candidate drugs in silico only makes sense if there is close coupling with laboratories that can actually synthesize and test the candidates.

In my originally envisioned post, I was going to end with a sort of cute twelfth item: staying at home.  Apart from lumberjacks, data scientists are among the professions most suited to not spreading the coronavirus according to this data presented by the New York Times. But in fact, this is not merely a cute conclusion: it is the one right contribution that data scientists can truly make well while in isolation off the bandwagon. When the fog clears, however, lets be deliberate and work interdisciplinarily to create full, well thought out, and tested solutions for mitigating and managing global pandemics.