Trusting Data

March 13, 2012

Howdy Señor Evan R. Lawson, CFO of HankMed!  I was recently impressed by the elevators at the Windsor Court apartments in New York City because you punch in your desired destination floor on the panel in the hallway rather than just whether you want to go up or down.  (I’ve been wanting to see that somewhere, anywhere, for several years.  In a system with several elevators, the extra information can help reduce waiting time.)  I was also recently impressed by the tags of items at Macy’s department store because Macy’s has bar-coded down to the individual item; if we each buy exactly the same two-toned shoes (same size, same brand, same design, same everything), the two tags will have different identification numbers.  (Because of this unique identification, if I return an item, I don’t have to have the receipt as my payment transaction information is recorded in the database with the item.)

Why do I mention these two things?  Mainly just because I wanted to, but also because they are related to the idea that when harnessed properly, more bits of data or information can lead to the smoother operation of life.

But data is tricky business and statistics is one of the three kinds of lies. 

One aspect of the SellerScope demo was its interactive nature.  One of the points brought up by John Patterson while we put the interactive visualization together was that users have to have trust.  In the demo, we show predictions of salespeople that are at risk of voluntarily leaving the company.  If one interacts with the data and predictions, it can be noted that at least in the fictional company whose data is loaded in the demo, salespeople tend to leave within the first 5 or 6 years of service with the company; after they’ve been around that long, they tend to stay the rest of their careers.  The beauty of the interaction is that if the user discovers that pattern his- or herself, then he or she is much more likely to trust it.

As we’re both well aware, there’s more data being generated than ever before in human history – exponentially more.  But I don’t think that the trust in data is there yet, nor should there be.  I don’t think there can be trust unless there is transparency and the ability for people to interact with data.  In discussing a data error with the 2010 BCS calculation, Jerry Palm wrote:

The BCS has shaky enough credibility with John Q. Footballfan as it is. It needs to verify its data. Right now, nothing is verified. Nothing is accountable. Except Colley. Thank goodness we at least have that.

Also as discussed in this Nature Biotechnology article:

Systems biology aims to provide a mechanistic understanding of biological systems from high-throughput data. Besides its intrinsic scientific value, this understanding will accelerate product design and development, facilitate health policy decisions and may reduce the need for long-term clinical trials. For this to happen, the knowledge generated by systems biology has to become sufficiently trustworthy for the empirical approach underlying long-term clinical trials to be supplanted by an approach in which mechanism and mechanistic understanding is a driver for decisions. This raises fundamental questions of how to evaluate the veracity of predictions from systems biology models and how to construct mechanistic models that best reflect biological phenomena—questions that are of interest to both academia and industry.

One of the movements to give ‘John Q. Footballfan’ the ability to interact with data so that he may trust it is known as Open Data.  My former IBM Research officemate Marc Szeto-Millstone has recently joined a proponent and enabler of this movement, Socrata.  However, sometimes data being completely open is not ideal either, because it diminishes the incentive to invest in obtaining difficult-to-obtain data, especially by companies.  It is very challenging to balance the trust that goes with openness and the financial disincentive that goes with openness, wouldn’t you say?  The authors of the Nature Biotechnology paper have made one attempt at that balance, but I’m not sure that there aren’t other better ways.

One thing I’m pretty sure about is that the recent trend to make the creation of static infographics easier will not solve the trust issue.  (You have posted images generated using Many Eyes and the Google Data Explorer on the blog previously; those platforms are really excellent for what they are intended to do.)  I think that even infographics made by skilled designers will only be trusted if the user can interact with the data and discover things without being explicitly told.  Hopefully within our lifetime, the kinds of lies will drop to two.



  1. […] Ashvins The Ultimate Machinists « Trusting Data Disruption, Dissociation, and Acting Crazy April 1, […]

  2. […] that absorbed me and I think played to my strengths. As you know, I’ve been dabbling a bit in visualization (with the help of the Cambridge office).  Something that is becoming very popular these days (but […]

  3. […] talked about this before, and in a related manner, John Steele Gordon in a column from July/August 1999 entitled […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: