h1

How and Why I Independently Published A Book

May 9, 2022

Good afternoon Señor Horace Greeley. Many people have asked, so I’d like to recount how and why I independently wrote and published the book entitled Trustworthy Machine Learning, which is available for free in html and pdf formats at http://www.trustworthymachinelearning.com and as an at-cost paperback at various Amazon marketplaces around the world (USA, Canada, UK, Germany, Netherlands, Japan, …).

Why I Wrote A Book

Writing a book is a big effort and a big commitment, so why do it? Just like you shouldn’t do a startup company just to be able to say you did a startup, it can’t be just because you want to have written a book. It has to be because you have something unique to say that the world needs to hear, and it is just bursting out of you.

I’d had the nondescript want for a book for a long time. But three years ago, I felt that there was something I needed to say. That was my approach and worldview for doing data science and machine learning that I had honed over a decade in an environment that few others experienced. And it felt like the deep learning revolution was missing some important things. I was ready to speak.

How It Started

In May 2019, I flew to Madrid to represent Darío at Fundación Innovación Bankinter’s Future Trends Forum. That trip was the only time in my life I’ve sat in business class and it was fortuitous because it happened mere days after I had a painful back spasm. After the meeting concluded, I had a few hours to kill before proceeding onwards to Geneva for the AI for Good Global Summit. Instead of risking my back with any tourism, I sat in a park (the thin green area on the map) and wrote down an entire outline for the book I was imagining. That outline ended up being close to that of the eventual finished product. Look below for exactly what I typed into the notes app of my phone that afternoon.

Introduction
 Age of Artificial intelligence 
    General purpose technology
 Trustworthiness
 Overview and Limitations 
   Overview 
   Limitations of book
   Biases of author
     Diverse voices

Preliminaries
 Uncertainty
   Aleatoric
   Epistemic
 Detection theory
   Confusion matrix
   Costs
   Bayesian detection
   ROC
   Calibration
   Robust (minimax) detection
   Neyman-Pearson detection
   Chernoff-Stein, mutual information theory, kl divergence
 Causality
 Directed graphical models

Data
 Finite samples
 Modalities
 Sources
   Administrative data
   Crowdsourcing
 Biases
   Temporal biases
   Cognitive biases/prejudice (quantization)
      Quantization only by words so don't have to introduce quantization and clustering
   Sampling biases
   Poisoning
 Privacy
   Causal basis included

Machine learning
 Risk minimization
 Decision stumps
    Trees, Forests
    Perceptron
    Margin-based methods
    Neural networks
 Adversarial methods
 Data augmentation
 Causal inference
 Causal discovery

Safety
 Epistemic uncertainty in machine learning
 Distribution shift
 Fairness
 Adversarial robustness
   (Causal foundations included in each pillar)
 Testing

Communication
 Explainability and interpretability
   Direct global
   Distillation / simple models
   Post hoc local
 Value alignment
   Unified theory
   Preference elicitation
   Specification gaming
 Factsheets
 Blockchain

Purpose
 Professional codes
 Lived experience
 Social good
   Types of problems with examples
 Open platforms 

Summer and Fall of 2019

Once I was back from Europe, the summer was upon us and that meant having our social good student fellows with us and their projects in full steam. That, along with my other work, also meant days full of meetings: a manager’s schedule rather than a maker’s schedule, so I didn’t do anything further on the book all summer. Here is my calendar on one of those summer days (and this wasn’t atypical).

In the fall of 2019, I had the honor of spending three months at IBM Research – Africa, in Nairobi, Kenya. Because of the time difference, I made myself only available for meetings 8 am to 11 am Eastern, which often meant entire mornings (East Africa Time) with no meetings (except for the nice conversations with the Africa lab researchers). Even though I thought I could use that time to start writing the book, I didn’t. Instead, the sabbatical turned out to be a great time to recover and recharge (while also doing some stuff on maternal, newborn and child health). Recovery is underappreciated.

Starting to Write

Back home, and with my calendar still mostly bare, I blocked off 90 minutes for writing every day starting on January 2, 2020. I started getting into a flow and put some words and equations down on paper (really this Overleaf). I made good progress on an introduction chapter and a detection theory chapter.

Then in mid-February, Bob Sutor stopped by my office and said that an acquisitions editor for the publisher he worked with on Dancing with Qubits was looking to publish a book on responsible and ethical AI, and connected me with Tushar. Coincidentally, the same week, an acquisitions editor for Manning Publications emailed me cold about my possible interest in writing a book. I had good conversations with both editors and I was naïvely happy at the perfect confluence of events.

I filled out book proposals for both companies. Here is the one I did for Packt:

and here is the one I did for Manning:

I was completely honest in explaining what I wanted to do (mix of math and narrative), who it was for, and so on. I even sent over the couple of chapters I had already written. Both publishers were happy and accepted my proposal. Both made very similar offers in the contractual terms, which wasn’t particularly important for me because I wasn’t doing this for the money. Manning had an early access program through which readers could access chapters as they were being written (which is what I wanted and also why I had made the Overleaf open when I was writing the first two chapters), so I decided to go with them. I signed on the dotted line on March 17, 2020.

Turbulence

Things did not go as I thought they might. Everything had shut down a week earlier because of the Covid-19 pandemic, and the shutdown did not abate in any way. I was sitting on a dilapidated sofa in my basement trying to complete other work, taking the kids outside to kick a soccer ball around once in a while, and plotting out how to get scarce groceries — not exactly conducive to writing. Certainly no more 90 minute blocks of time daily.

More turbulent than that, however, was the publisher trying to shoehorn me into what they wanted. My proposal was very clear that the book would have a decent amount of math and no software code examples, would be a tour of different topics, and would be centered on concepts. But that didn’t seem to matter once things were underway. As I soon learned, Manning religiously follows Bloom’s taxonomy, and understanding concepts is very low on the totem pole. As instructed, I doggedly kept trying to push my text higher in the taxonomy, but it was mostly a farce to me, where I would just use the word “sketch” or “appraise” while still saying what I was going to say. I was also ruthlessly trying to reduce the math at their insistence. For example, the chapter on uncertainty as a concept morphed into evaluating safety.

There was a lot of back and forth, and a lot of frustration. Eventually, on February 16, 2021, the book was available for sale in the $40-$60 range through the early access program with the first four chapters available. We celebrated. I got a lot of positive feedback from people I know.

But the turbulence didn’t calm down. More Bloom, less math, and less of myself. I am not someone who uses the word “grok“. I didn’t want this to be a prescriptive recipe book because I don’t believe that that is what trustworthy machine learning is all about.

The book reached 320 sales by the time the first 12 chapters had been posted, which in my opinion is pretty darn good for something that is not even complete and with an underwhelming marketing effort.

Then came an ending and a rebirth. On September 10, 2021, the acquisitions editor reached out and said that the publisher would be ending the contract and the rights to the content would revert back to me. I guess the sales weren’t what they needed and the content continued to be mismatched from the desires of their typical buyers. This turn of events ended up being more of an emotional relief than anything else.

Did the book improve because of all that back and forth? On balance, I’d say yes. So no hard feelings.

Finishing

I am not one to leave things unfinished, and I wasn’t going to let the ending of the contract hold me back from finishing the manuscript that I had toiled on for so very long at that point. I vowed to complete the whole thing by the end of the calendar year. In less than 4 months, I wrote the remaining 6 chapters: an unbridled pace much faster than what I had been doing before.

I didn’t think much about what the route to get it out would be in September or October. Tushar reached out and offered to bring it to market through Packt, but I just wanted to focus on finishing it. And I did, on December 30!

By that time, I had made up my mind to post it online with a Creative Commons license to begin with. I created the website http://www.trustworthymachinelearning.com and posted a pdf of version 0.9. I quietly spread the word and kept getting a lot of positive response from acquaintances.

Independently Published

While a diverse panel I had assembled was giving version 0.9 a look over and providing feedback, I did a bunch of soul-searching on what this book was for and why I was doing it. I also pored over what people had written about self-publishing in today’s age. I clearly wasn’t in it for the money — I was more than happy for anyone in the world to learn from it without paying. In fact, empowering people, no matter their station in life, is one of the messages of the book. I wanted its message to ring far and wide.

While everyone has a little vanity in them, like I said at the beginning of this post, I hadn’t written the book just to have written a book. This was also not a book aiming for some kind of book award. I wasn’t going to be using it for an academic tenure or promotion case, or any other stamp of approval. I didn’t want IBM to be involved in any explicit way (Manning had actually sought that out through a sponsorship deal). I enjoy doing a little formatting and aesthetic stuff here and there, and copy-editing. The previous experience hadn’t shown me that a publisher would necessarily do the right kind of marketing. Kindle Direct Publishing is really easy, doesn’t require any capital investment, and has very wide reach.

Putting all of that thinking together, despite not having heard of others in my orbit doing it before, I decided to independently publish the book. It has been up on Amazon since February 16, 2022 at the lowest possible price that Amazon allows for covering their costs. I’ve been very happy with my decision. It suits me and my worldview.

Afterwards

That very day, February 16, I made a social media push about the book, and that very night, I received this very kind email from Michael Hassan Tarawalie:

Dear sir, 

It is an honor to come in contact with you, sir. Am a student at the electrical and electronic department, faculty of engineering Fourah Bay College, University of Sierra Leone.

Sir your book has helped me.

One of the very first citations to the book was in the influential report by NIST entitled “Towards a Standard for Identifying and Managing Bias in Artificial Intelligence”.

There have been several great reviews of the book on Amazon from people I don’t know. It has become almost a cottage industry for people to hold up their copy of the paperback in large meetings I attend on Zoom and for others to post photos holding their copy on social media.

As of today, 481 copies of the book have been printed and shipped across the world in less than 3 months. Even though I’m not tracking it, I’m sure lots of people have accessed the free pdf and used it to uplift themselves.

This is what I wished for.

It always seems impossible until it’s done.

Nelson Mandela
h1

The Pandemic Bandwagon

March 21, 2020


A Wuhan-shake to you señor. Hope you’re doing alright with the shelter-in-place order for Santa Clara County and California. We’re on pause here in New York.

Yesterday morning, I was all psyched up to do a blog post and accompanying twitter thread on 12 Data Science Problems in Mitigating and Managing Pandemics Like COVID-19 that would go through several issues related to the crisis that have some avenue for data science (broadly construed) to contribute to. I’ve been glued to twitter the last few evenings and a lot of different people have been posting various things. I have things to share, I thought, so why not me?

A wise person asked me to reflect on whether it would be a sensible thing to do. She emphasized that “there are so many people who are jumping on the bandwagon trying to help. Some mean well while some are capitalizing on the situation. And of those that mean well, some are offering silly things.” As you’ve told me on occasion, Shannon was wary of the bandwagon as well, and much preferred the “slow tedious process of hypothesis and experimental validation.” He noted that “a few first rate research papers are preferable to a large number that are poorly conceived or half-finished.” What would he have said to streams of consciousness offered up 280 characters at a time? Adam Rogers wrote yesterday afternoon that “the chatter about a promising drug to fight Covid-19 started, as chatter often does (but science does not), on Twitter.”

I woke up this morning wishing for science, not chatter. I realized that I am not among “men ages 24–36 working in tech” predisposed to “armchair epidemiology.” I turned 37 a whopping five months ago!

Rogers continued: “Silicon Valley lionizes people who rush toward solutions and ignore problems; science is designed to find solutions by identifying those problems.”

So lets talk about problems and how run-of-the-mill data scientists working in isolation, both literally and figuratively, usually lack the requisite problem understanding to make the right contribution.

In dealing with global disease outbreaks, such as the ongoing novel coronavirus pandemic, we can imagine four main opportunities to help: surveillance, testing, management, and cure. We are primarily concerned with zoonotic diseases: diseases that transfer from animals to humans.  By surveilling, we mean tools and techniques for predicting or providing early warnings of outbreaks of novel or known pathogens. By testing, we mean diagnosing individual patients with the disease. By managing, we mean the tools and techniques for better understanding and limiting the spread of the outbreak, providing care, and engaging the citizenry.  By curing, we mean the development of therapeutic agents to administer to infected individuals. In all of these areas, the lone data scientist working without true problem understanding can be misguided at best and detrimental at worst.

Surveilling

  1. Zoonotic pathogen prediction. There are a large number of known pathogens, but for most of them, it is not known whether they can transfer from animals into humans (and develop into outbreaks). It may be possible to predict the likely candidates by training on features of known zoonotic pathogens. We tried doing it a few years ago in partnership with disease ecologist Barbara Han who defined the relevant features, but didn’t get very far because the features of pathogens are not available in a nice clean tabular dataset; they are locked up inside scientific publications. Knowledge extraction from these very specialized documents automatically requires a lot of expert ecologist-annotated documents, which is not tenable. Even if we were able to pull together a dataset suitable for predicted zoonoses, we wouldn’t know how to make heads or tails of the results without the disease ecologists.
  2. Informed spillover surveillance. Once a pathogen is known as a zoonotic disease and has had an outbreak, it is important to monitor it for future outbreaks or spillovers. Reservoir species harbor pathogens without having symptoms and without dying, waiting for a vector to carry the disease to humans and start another outbreak. In the first year of the IBM Science for Social Good initiative, we partnered with the same disease ecologist to develop algorithms for predicting the reservoir species of primates for Zika virus in the Americas so that populations of those species could be further scrutinized and monitored. Without Barbara, we would have had no clue about what problem to solve, what data sources to trust, how to overcome severe class imbalance in the prediction task (by combining data from other viruses in the same family), and how the predictions could inform policy.
  3. Outbreak early warning. The earlier we know that an outbreak is starting, the earlier actions can be taken to contain it. There are often small signals in various reports and other data that indicate a disaster is beginning. BlueDot knew something was up with the novel coronavirus as early as December 30, 2019, but they’ve been at this for quite a while and have a team that includes veterinarians, doctors, and epidemiologists. Even then, their warnings were not heeded as strongly as they could have been.

Testing

  1. Group testing. There are shortages of COVID-19 tests in certain places. Well-meaning data scientists ask the question: isn’t there a smart way to test more people with the same number of tests (and I’ve seen it asked several different times already, including in an email that a friend from grad school sent both of us). Eventually, someone points out the method of group testing, which has been known since WWII. But even that is not the solution for the current method of testing (PCR). You pointed out in your response to the friend that group testing would require a serological test for COVID-19, which isn’t ready yet. A case of solving a problem with an already known solution that is actually not a relevant problem.
  2. Deep learning from CT images. Deep neural networks have achieved better accuracy than expert physicians in several medical imaging tasks in radiology, dermatology, ophthalmology, and pathology, so it is natural that several groups would try training them for diagnosing COVID-19. Again, a well-meaning effort, but sometimes not executed very well. E.g. this paper uses CT images of COVID-19-confirmed patients from China as the positive class and images of healthy people from the United States as the negative class — which may introduce spurious correlations and artificially inflate the accuracy.  Even if this task is well done, will it find its way into clinical practice?  That has not yet been the case in those tasks mentioned above despite the initial demonstrations having happened several years ago.
  3. Classifying breathing patterns. A paper posted to arXiv with the title Abnormal respiratory patterns classifier may contribute to large-scale screening of people infected with COVID-19 in an accurate and unobtrusive manner claims that “According to the latest clinical research, the respiratory pattern of COVID-19 is different from the respiratory patterns of flu and the common cold. One significant symptom that occurs in the COVID-19 is Tachypnea. People infected with COVID-19 have more rapid respiration.” but the authors provide no reference to this clinical research and I haven’t been able to track it down myself. If there isn’t really any distinguishing difference between respiration patterns with flu and COVID-19, then this work is in vain, and could have been avoided by conferring with clinicians.

Managing

  1. Spatiotemporal epidemiological modeling. Once an outbreak has started, it is important to model its spread to inform decision making in the response. This is the purview of epidemiology and has a lot of nuance to it. Small differences in the input can yield large differences in the output. This should be left to the experts who have been doing it for many years.
  2. Data-driven decision making. Another aspect to managing an outbreak is collecting primary (e.g. case counts), secondary (e.g. hospital beds, personal protective equipment), and tertiary (e.g. transportation and other infrastructure) information. This is highly challenging and in a disaster situation requires both formal and informal means. During the 2014 Ebola outbreak, we observed that there was a lot of enthusiasm for collecting, collating, and visualizing the case counts, but not so much for the secondary and tertiary information, which, according to the true experts, is really the most important for managing the situation. The same focus on the former is true now, but at least there is some focus on the latter. Enthusiasm is great, but better when directed to the important problems.
  3. Engaging the public. In managing outbreaks, it is critical to inform the public of best practices to limit the person-to-person spread of the disease (which may go against cultural norms) and also to receive information from the situation on the ground. This has been done to effect in the past such as during the Ebola outbreak and in certain places now, but seems to be lacking in many other places. Misinformation and disinformation in peer-to-peer and social network platforms appears to be rampant, but there seems to be little ‘tech solutioning’ in this space so far – perhaps the energy is being spent elsewhere.

Curing

  1. Drug repurposing. Interestingly, drugs developed for particular diseases also have therapeutic effect on other diseases. For example, chloroquine, an old malaria drug has an effect on certain cancers and anecdotally seems to show an effect on the novel coronavirus. By finding such old generic drugs whose safety has already been tested and which might be inexpensive and already in large supply, we can quickly start tamping down an outbreak after the therapeutic effect is confirmed in a large-scale clinical trial. But such findings of repurposing are difficult to notice at large scale without the use of natural language processing of scientific publications. A consortium recently released a collection of 29,000 scientific publications related to COVID-19 (CORD-19), but there is very little guidance for NLP researchers on what to do with that data and no subject matter expert support. Therefore, it seems unlikely that anything of much use will come out of it.
  2. Novel drug generation and discovery. Repurposing has its limits; we must also discover completely new drugs for new diseases. State-of-the-art generative modeling approaches have begun that journey, but are currently difficult to control. And moreover, consulting subject matter experts is required to figure out what desirable properties to control for in the generation: things like toxicity and solubility. Finally, generating sequences of candidate drugs in silico only makes sense if there is close coupling with laboratories that can actually synthesize and test the candidates.

In my originally envisioned post, I was going to end with a sort of cute twelfth item: staying at home.  Apart from lumberjacks, data scientists are among the professions most suited to not spreading the coronavirus according to this data presented by the New York Times. But in fact, this is not merely a cute conclusion: it is the one right contribution that data scientists can truly make well while in isolation off the bandwagon. When the fog clears, however, lets be deliberate and work interdisciplinarily to create full, well thought out, and tested solutions for mitigating and managing global pandemics.

h1

NAACL Stats

June 3, 2018

Comment ça se plume? The venerable Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) reconvenes this week. The Great AI War of 2018 revisits New Orleans for another skirmish. 

Following my previous posts on AISTATS paper counts, ICASSP paper counts, ICLR paper counts, and SDM paper counts, below are the numbers for accepted NAACL papers among companies for long papers, short papers, industry papers, and all combined.

Company Paper Count (Long)
Microsoft 10
Amazon 5
Facebook 5
IBM 4
Tencent 4
DeepMind 3
Google 3
JD 3
NTT 3
Adobe 2
Elemental Cognition 2
PolyAI 2
Siemens 2
Agolo 1
Aylien 1
Bloomberg 1
Bytedance 1
Choosito 1
Data Cowboys 1
Educational Testing Service 1
Fuji Xerox 1
Grammarly 1
Huawei 1
Improva 1
Interactions 1
Intuit 1
Philips 1
Samsung 1
Snap 1
Synyi 1
Thomson Reuters 1
Tricorn (Beijing) Technology 1
Company Paper Count (Short)
IBM 4
Google 3
Microsoft 3
Facebook 2
Adobe 1
Alibaba 1
Amazon 1
Ant Financial Services 1
Bloomberg 1
Educational Testing Service 1
Infosys 1
NTT 1
PolyAI 1
Preferred Networks 1
Roam Analytics 1
Robert Bosch 1
Samsung 1
SDL 1
Tencent 1
Thomson Reuters 1
Volkswagen 1
Company Paper Count (Industry)
Amazon 6
eBay 4
IBM 2
Airbnb 1
Boeing 1
Clinc 1
Educational Testing Service 1
EMR.AI 1
Google 1
Interactions 1
Microsoft 1
Nuance 1
SDL 1
XING 1
ZEIT online 1
Company Paper Count (Total)
Microsoft 14
Amazon 12
IBM 10
Facebook 7
Google 7
Tencent 5
eBay 4
NTT 4
Adobe 3
DeepMind 3
Educational Testing Service 3
JD 3
PolyAI 3
Bloomberg 2
Elemental Cognition 2
Interactions 2
Samsung 2
SDL 2
Siemens 2
Thomson Reuters 2

My methodology was to click on all the pdfs in the proceedings and manually note affiliations.

h1

SDM Stats

May 3, 2018

Hello! The venerable SIAM International Conference on Data Mining (SDM) reconvenes today for its eighteenth edition. The Great AI War of 2018 heads down the Pacific coast. 

Following my previous posts on AISTATS paper counts, ICASSP paper counts, and ICLR paper counts, below are the numbers for accepted SDM papers among companies.

Company Paper Count
IBM 5
Baidu 2
Samsung 2
Adobe 1
Facebook 1
Google 1
LinkedIn 1
NEC 1
NTUC Link 1
PPLive 1
Raytheon 1

My methodology is a manual scan of the printed program.

h1

ICLR Stats

April 27, 2018

Hello bonjour! The venerable International Conference on Learning Representations (ICLR) reconvenes Monday for its sixth edition. The Great AI War of 2018 heads a little west. 

Following my previous posts on AISTATS paper counts broken down by institution and ICASSP paper counts broken down by company, below are the numbers for accepted ICLR main conference papers among the top companies.  Like in my ICASSP stats, Google and DeepMind are not treated separately.

Company Paper Count
Google 68
Microsoft 19
Facebook 14
IBM 8
Salesforce 6
Baidu 4
NVIDIA 4
Intel 3

My methodology this time relied on the data compiled by pajoarthur including his logic for converting email addresses to institutions.  However, I aggregated the numbers differently.  He considered ‘Invite to Workshop Track’ status papers in his counts, whereas I did not.  He evaluated the contribution of an author by dividing by the total number of authors of a paper, and then summing up these partial contributions by company; like I did for AISTATS and ICASSP, I counted a paper for a company if it had at least one author from that company.

h1

ICASSP Stats

April 15, 2018

Annyeong! The venerable IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) reconvenes today for its forty-third edition. The Great AI War of 2018 rolls on. 

If you ask how signal processing has become AI, read my recent essay on the topic.

Following my previous post on AISTATS paper counts broken down by institution, I present the numbers for ICASSP below, but only for companies.  This time around, Google and DeepMind are not treated separately.

Company Paper Count
Google 38
NTT 26
Microsoft 25
IBM 16
Huawei 11
Amazon 9
Mitsubishi 9
Samsung 7
Facebook 6
Tencent 6
Alibaba 5
Apple 5
Ericsson 4
Robert Bosch 4
Starkey Hearing Technologies 4
Tata Consultancy Services 4
SRI International 3
Technicolor 3
Toshiba 3
Adobe 2
Analog Devices 2
GE 2
GN 2
Haliburton 2
Hitachi 2
Intel 2
NEC 2
Orange Labs 2
Origin Wireless 2
Qualcomm 2
Raytheon 2
Sony 2
Spotify 2
Thales 2
Toyota 2

I used the official paper index and did not dig deeper into the papers in any way.

h1

AISTATS Stats

April 9, 2018

Hola! The venerable International Conference on Artificial Intelligence and Statistics (AISTATS) reconvenes today for its twenty-first edition. It has recently become common for there to be blog posts presenting the counts of papers at machine learning conferences broken down by institution.  I had not seen it for AISTATS 2018, so I went ahead and put the numbers together.

Institution Paper Count
MIT 13
UC Berkeley 12
Carnegie Mellon 11
Google 11
Stanford 11
IBM 9
Oxford 9
Princeton 9
INRIA 8
Texas 8
Duke 7
EPFL 7
Cornell 6
DeepMind 6
Harvard 6
Microsoft 6
Tokyo 6
ETH Zurich 5
Georgia Tech 5
Michigan 5
Purdue 5
RIKEN 5

Since The Great AI War of 2018 is apparently ongoing, here are the numbers for companies.

Institution Paper Count
Google 11
IBM 9
DeepMind 6
Microsoft 6
Adobe 3
Amazon 2
NTT 2
Baidu 1
Charles River Analytics 1
D. E. Shaw 1
Disney 1
Face++ 1
Facebook 1
Mind Foundry 1
NAVER LABS 1
NEC 1
Netflix 1
Prowler.io 1
SigOpt 1
Snap 1
Tencent 1
Vicarious 1
Volkswagen 1

In case one is tempted to add the Google and DeepMind numbers together, note that there is one paper in common, so the total is 16, not 17 for the two in combination.

Affiliation to institution is not an exact science, and is exacerbated by the official accepted papers list here not containing the affiliations of many authors (and in some cases not even being the final list of authors for papers), there being many ways to refer to the same institution, and there being ambiguity of what is a single institution and what is multiple (this is especially difficult for me among French institutions).  I have done my best to find preprints and look at personal websites to fill in and correct institutions given in the accepted papers list. Here is the raw data file that I put together.

h1

Circle of Life

March 18, 2017

Jambo Señor Bernard Lagat!  Greetings from inside the Maasai Mara where I got to see the whole cast of the Circle of Life.  Before coming for my first trip to Africa, my knowledge of the continent, like nearly all Americans, was pretty much completely derived from The Lion King.  As I was psyching myself up for the journey, I played a medley of songs from the film along with “Dry Your Tears, Afrika.”

However, once I got here, it was not Africa that was drying its tears, but me, as I explained to a member of the Clinton Health Access Initiative how the kindness of the American people over the last 70 years has led directly to our family’s and my being in the position we are in.  Whether we take Sam Higginbottom and Mason Vaugh’s Allahabad Agricultural Institute that gave Baba his first job and encouraged his growth, the sponsors that allowed him to come to Illinois for higher studies with Bill Perkins not once but twice, the granters of the tuition waivers that allowed Papa to study there himself, the policymakers whose policies encouraged someone like him to work and gain lawful residence in the U.S., the agency program managers who funded Papa’s research, or the people behind the National Science Foundation Graduate Research Fellowship that allowed both of us to thrive in graduate school, the American people have consistently encouraged achieving the dream through hard work and the rewarding of skill, knowledge, and expertise regardless of caste or creed.

But it seems like we’re going through a “cultural revolution” that some point to having arisen from the hollowing out of the middle class and rising income inequality due to increased automation of jobs by technological solutions.  I don’t think anyone should be promoting extreme income inequality, but solutions should come from the science and technologies themselves rather than from crippling advances in the science and technologies that have gotten us to this point.  As Stefano Ermon says, “It’s very important that we make sure that [artificial intelligence] is really for everybody’s benefit.”  Last October I submitted a proposal for an artificial intelligence grand challenge for IBM Research to work on, ultimately not selected, on exactly this topic: reducing economic inequality.  Given all that has transpired in the intervening months, my belief that such a project should be undertaken has only strengthened.

Here in Kenya, I had the pleasure of visiting the startups Soko and mSurvey who are both doing their part in democratizing production and the flow of information.  Both have developed profitable technology-based solutions that happen to push back against inequalities in the developing world.  Back in March 2013, I had submitted a proposal “Production by the Masses” for inclusion in IBM Research’s longer-term vision for the corporation, also not selected, which has some of the elements that these two companies and others like them epitomize.  However, it also failed to fully anticipate some of the things that have taken hold recently like the ridiculously high value of data and the power of the blockchain’s distributed ledger, and over-emphasized the distinction between rural and urban populations.  I now see that the same sort of stuff is needed everywhere there are inequalities, which is everywhere.

Yes there is an ideal inclusive Circle of Life (fragile enough to be Scarred).  Let us all strive for that ideal by valuing knowledge and by using existing and new science and technology.

20170319_104415

h1

Flying, flying, digging, digging

October 27, 2015

As you know, I’ve been travelling quite a bit the last month or so.  I think I may have put on more miles per unit time than ever before.  While flying around, I read a good number of popular books that I had been meaning to, in the broad area of information science.  For example, I read Bursts by Lazslo Barabasi and learned more about Transylvanian history than I intended.  I also read Social Physics by my one-time collaborator Sandy Pentland, as well as The Life and Work of George Boole: A Prelude to the Digital Age by Desmond MacHale.  I had received this last book as a gift for giving one of the big talks at the When Boole Meets Shannon Workshop at University College Cork in early September.  An extensive biography, it also emphasizes how Boole’s The Laws of Thought makes a strong connection between logic and set theory on the one hand and probability theory on the other, a hundred years before Kolmogorov.  When Boole was reading extracts from the book-in-progress to his wife-to-be Mary Everest, [p. 148]:

She confessed that she felt comforted by the fact that the laws by which the human mind operates were governed by algebraic principles!

Incidentally, in 1868, Mary Boole also wrote a book, The Message of Psychic Science, which had the following rather prescient passage inspired by Babbage’s computer and Jevons’ syllogism evaluator [p. 267]:

Between them they have conclusively proved, by unanswerable logic of facts, that calculation and reasoning, like weaving and ploughing, are work, not for human souls, but for clever combinations of iron and wood.  If you spend time doing work that a machine could do faster than yourselves, it should only be for exercise, as you swing dumb-bells; or for amusement as you dig in your garden; or to soothe your nerves by its mechanicalness, as you take up knitting; not in any hope of so working your way to the truth.

Speaking of iron and wood, one last book I read in my travels is Why Information Grows by Cesar Hidalgo, and a first thing he discusses is how solids are needed to store information.  As he says, close to my heart [p. 34]:

Schrodinger understood that aperiodicity was needed to store information, since a regular crystal would be unable to carry much information.

Let me list my travel venues for you:

And now with that travel done, I think I’ll be going hard on writing and maybe even some theorem-proving and data analytics.  As we’ve discussed, I find blogging to sometimes jump start the writing/doing engines, and so here we go with cities.  

As promised previously, I perform some formal tests for lognormal distributions of house sizes in Mohenjo Daro and in Syracuse. As a starting point, I used the lognfit function in matlab to find the maximum likelihood estimates of the fit parameters and also the 95% confidence intervals.  The two parameters are the mean μ and standard deviation σ of the associated normal distribution.  The estimated value of σ is the square root of the unbiased estimate of the variance of the log of the data.  Rather than showing the rank-frequency plots as in the previous post, let me show the cumulative distribution functions.  Note that in Syracuse data, about 1/5 of houses do not have a listed living area, so I exclude them from this analysis.

cdf_area

cdf_larea

At least visually, these don’t look like the best of fits.  To measure the goodness of fit, I use the chi-square goodness-of-fit test as implemented in matlab as chi2gof.  With data ‘area’ already fit using lognfit into parameter vector ‘parmhat’, this is [h,p] = chi2gof(area,’cdf’,@(z)logncdf(z,parmhat(1),parmhat(2)),’nparams’,2). Despite the visual evidence, the chi-square test does not reject the null hypothesis of lognormality at the 5% confidence level for Mohenjo Daro.  The chi-square test does reject the null hypothesis of lognormality at the 5% confidence level for Syracuse, contrary to the theory of Bettencourt, et al.  I wonder what the explanation might be for this contrary finding in Syracuse: maybe some data fidelity issues?

By the way, I also promised some other nuggets and so here is one: the relationship between living area and value in Syracuse.

fullval_larea.

There is certainly more than just living area that determines value.  In fact, the methodology of assessing house value is an interesting one.  One more nugget is on when houses that existed in Syracuse in July 2011 were built.

yearbuilt_h

I wonder if there is a way to understand this data through a birth-death process model.  There is a nice theoretical paper in this general direction, “Random Fluctuations in the Age-Distribution of a Population Whose Development is Controlled by the Simple “Birth-and-Death” Process,” by David G. Kendall from the J. Royal Statistical Society in 1950.

To close the story with more birth and death, unlike studying Mohenjo Daro, on which there is little information, the difficulty for future historians will certainly be too much data. Before the travels, I finished reading through a popular book Dataclysm by the author of the OKCupid blog, in a sense it is an expanded version of that blog. One of the big things that is pointed out is that there will be growing longitudinal data about individuals due to social media such as Facebook.  Collections of pennants are eventually taken down from bedroom walls, but nothing is taken down from Facebook walls. It uses culturomics.  

As I may have foreshadowed, Ron Kline’s book, The Cybernetic Moment (that I helped with a little bit), also uses culturomics a little bit to measure the nature of discourse.

So that is some flying, flying, and digging, digging from me.  Hope you’ll contribute to the discourse so future historians have more to study.  By the way, the city sizes for the various places (as per Wikipedia today) are, from large to small:

  • 13,216,221 – Tokyo
  • 2,722,389 – Chicago
  • 435,413 – Jeju
  • 84,513 – Champaign
  • 67,947 – Santa Fe
  • 41,250 – Urbana
  • 12,019 – Los Alamos
  • 5,138 – Monticello

Perhaps data for a statistical assessment?

h1

Digging into a city

March 26, 2015

Glad to see your work described previously now appearing in a journal paper, but also glad to know that it is doing social good.  One of the main things you did was look for buildings from satellite imagery, which is really quite a neat thing.  As you know, I have been quite intrigued by the science of cities, and perhaps data from satellite imagery can be useful to make empirical statements there.  Can one see municipal waste remotely?  In anticipation of that, perhaps I can dig through some data on cities that I happen to have, and see if there are interesting statements to be made regarding scaling laws within cities (in contrast to most work that has focused on scaling laws among cities, though I should note the work of Batty, et al.).  As examples, I will consider recent data from our hometown of Syracuse, NY and also data from Mohenjo Daro of the Indus Valley civilization.  

As you can guess, the Syracuse data was gathered from my service on an IBM Smarter Cities Challenge team, by digging through some old servers held by a not-for-profit partner of the City of Syracuse.  The journal paper on that is finally out, but more importantly it seems to be having some social impact.  Here is a newer video on impacts of what we did there.

The data on Mohenjo Daro is from actual digging, rather than digging through computers.  Built around 2600 BCE, Mohenjo Daro was one of the largest settlements of the ancient Indus Valley Civilization and one of the world’s earliest major urban settlements.  Mohenjo Daro was abandoned in the 19th century BCE, and was not rediscovered until 1922.  The data I will use was initially mapped by British archaeologists in the 1930s in their excavation of Mohenjo Daro, and collected in the paper [Anna Sarcina, “A Statistical Assessment of House Patterns at Moenjo Daro,” Mesopotamia Torino, vol. 13-14, pp. 155-199, 1978.].

Before getting to the data, though, let me describe some theoretical work on the distribution of house sizes from a recent paper of Bettencourt, et al. in a new open access journal from AAAS.  From the settlement scaling theory developed, they make a prediction on the distribution of house areas.  In particular, the overall distribution should be approximately lognormal.  This prediction is borne out in archeological data of houses in pre-Hispanic cities in Mexico.  The basic argument for why the lognormal distribution should arise is from a multiplicative generative process and the central limit theorem.  A reference therein attributes the argument back to William Shockley in studying the productivity of scientists, but according to Mitzenmacher it goes back even further.  (Service times in call centers also appear to be approximately lognormal, among other phenomena)

Anyway, coming to our data, let me first show the rank-frequency plot of the surface area (m2) of 183 houses in Mohenjo Daro.

Mohenjo Daro

Now I show the rank-frequency plot of the living area (ft2) of 41804 houses in Syracuse (data from July 2011).Syracuse

What do you think?  Does it look approximately lognormal?  I’ll soon write another blog post with some formal statistical analysis, and some other nuggets from these data sets.

Incidentally, as requested, I seem to be making creativity a part of my research agenda (from an information theory and statistical signal processing perspective).  I spoke about fundamental limits to creativity at the ITA Workshop in San Diego in February (though the talk itself ended up being slightly different than the abstract).  I also organized a special session on computational creativity, which was fun.  

I think someone should connect creativity and cities in some precise informational way, and perhaps you are the man to do it.