Archive for November, 2022


Three Levels of AI Auditing

November 29, 2022

The phrase hook, line, and sinker usually refers to fooling or deceiving someone, but I don’t think it has to. It can also just mean convincing someone thoroughly. The question of what an AI audit should precisely be, especially auditing for equity, is receiving greater attention. In my mind, there are three levels of AI auditing, just like the hook, line, and sinker. The hook is a journalistic approach with an individual attention-grabbing, persuasive narrative example that is easy to grasp, but may or may not reveal a systemic problem. The second level is an outside-in study done of a system that may reveal a pattern, and it might not even be exactly related to the first level hook. The third level of auditing requires access to the internals of the system and could be quite detailed. The decision maker doesn’t even have to be AI; human decision making can be audited in the same ways.

Let’s look at a few examples to get a better sense of what I mean.

Gender Shades

The first-level audit in Joy Buolamwini’s Gender Shades work was her using a white mask to show how a face tracking algorithm didn’t work well on dark-skinned faces.

Her second-level audit was creating the small Pilot Parliaments Benchmark dataset and using it to report intersectional differences in the gender classification task of commercial face attribute classification APIs.

She didn’t have access to the internals, but with access to some of the embedding spaces used by the models, we did a third-level analysis. We found that skin type and hair length are unlikely to be contributing factors to disparities, but it is more likely that there is some mismatch between how dark-skinned female celebrities present themselves and how dark-skinned female non-celebrities and politicians do, especially in terms of cosmetics.

Apple Card

The discovery of gender bias in the Apple Card began with a single example reported in a tweet, and it hooked a lot of people.

Enough so that the New York State Department of Financial Services launched a detailed third-level investigation, eventually exonerating Apple Card and Goldman Sachs, the financial firm that backed it.


ProPublica understood these levels as they disseminated the findings of their famous study on the COMPAS algorithm for predicting criminal recidivism. The main article included both the hook of individual stories, like those of Bernard Parker and Dylan Fugett, as well as statistical analysis of a large dataset from Broward County, Florida. The more detailed articles of the first level and second level were published alongside.

Northpointe (now Equivant), the maker of COMPAS did their own analysis to refute ProPublica’s analysis on the same data, so still second-level. (The argument hinged on different definitions of fairness.) The Wisconsin supreme court ruled that COMPAS can continue to be used, but under guardrails. I don’t think there has ever been a third-level analysis that breaks open the proprietary nature of the algorithm.

Asylee Hearings

Reuters did the same thing as ProPublica in a story about human (not AI) judgements in asylum cases. A first-level part of the story focuses on two women: Sandra Gutierrez and Ana, who have very similar stories of seeking asylum, but were granted and not granted asylum by different judges. A second-level part of the story focuses on the broader pattern across many judges and a large dataset.

Given that all of the data is public, Raman et al. did a third-level in-depth study on the same issue. They found that partisanship and individual variability among judges have a strong role to play in the decisions, without even considering the merits of a case. This is a new study. It will be interesting to track what happens because of it.


There are many other examples of first-level audits (e.g. an illustration of an object classifier labeling black people as gorillas, a super-resolution algorithm making Barack Obama white, differences in media depictions of Megan Markle and Kate Middleton, language translations from English to Turkish to English showing gender bias, and gender bias in images generated from text prompts). They sometimes lead to second- and third-level audits (e.g. image cropping algorithms that prioritize white people and disparities in internet speed), but often they do not.

So What?

Each of the three levels of audits have a role to play in raising awareness and drawing attention, hypothesizing a pattern, and proving it. To the best of my knowledge, no one has laid out these different levels in this way, but it is important to make the distinction because they lead to different goals, different kinds of analysis, different parties and access involved, and so on. As the field of AI auditing gets more standardized and entrenched, we need to be much more precise in what we’re doing — and only then will we achieve the change we want to see, hook, line, and sinker.


There is No Generic Algorithmic Fairness Problem

November 28, 2022

Hello Lav. I know you’re a proponent of block diagrams and other similar abstractions because they permit a kind of elegance and closure that helps us make progress as scientists. I’m all for it too, except when we need to take that progress all the way down to applications that have very contextual nuances. For example, I’d be very happy if these kinds of diagrams of algorithmic fairness and bias mitigation (drawn by Brian d’Alessandro et al. and Sam Hoffman, respectively) were all we needed, but let me talk through how context matters. (This will be a discussion different from choosing the most appropriate fairness metric, which has its own nuance, and from asking whether fairness should even be viewed as a quantitative endeavor.) This was part of a presentation I made for NIST in August.

Let’s go through a series of 7 real-world examples.

The first is clinical prediction of postpartum depression and its fairness across race. This one is almost as generic as you can get because the protected attribute is clear, it is a typical machine learning lifecycle, and so on. But the nuance in this application is the importance of focusing only on the riskiest patients (top decile). Someone who does not know what they are doing might just classify patients at above or below the mean or median risk.

The second is skin disease diagnosis from medical images. Here the nuance is that although the protected attribute is somewhat clear (skin type), it is not given with the dataset. The images have to be segmented so that only healthy skin is used to estimate a skin color, which is then grouped according to individual typology angle (ITA). The ITA and its groupings are itself not without controversy.

The third is legal financial obligations (fines and fees) in Jefferson County, Alabama. Here the fairness does not have to do with machine learning or artificial intelligence-based predictors, but on analyzing human decision-making to discover and finely characterize bias issues. All the analysis happens before the d’Alessandro and Hoffman block diagrams.

The fourth is the Ad Council’s “It’s Up to You” campaign for Covid-19 vaccine awareness and wanting all people to receive the messaging equally effectively. In this application having to do with targeted advertising, the labels are highly imbalanced so that a typical classifier would just always predict one of the classes, which also doesn’t really allow for bias mitigation. Here, the class imbalance has to be dealt with first before dealing with algorithmic fairness.

The fifth is “bust or boom” prediction in ESPN’s fantasy football site, where the predictions for individual players have unwanted bias with respect to their team membership. The nuance here is that the fairness component (AI Fairness 360 in this case) is just a tiny part of the overall, highly complicated system architecture, and cannot just be thrown in haphazardly.

The sixth is in predicting the need for preventative health care management while using health cost or utilization as a proxy, and the racial bias against blacks in the United States it yields. This problem became quite well-known due to a study by Obermeyer et al. However, being more nuanced and splitting up different health care costs as proxies (in-patient, out-patient, emergency) yields much less racial bias, whereas a typical data engineering step is to add together all health costs into a single variable.

The seventh is in child mortality prediction in sub-Saharan Africa, where there may be bias in prediction quality across countries. The nuance in this problem is that it also has a significant problem with concept drift over time, so that bias is not the main issue. If drift and bias are not disentangled, then the results will not be meaningful.

Studying a genericized fairness problem is a good thing to do from a pedagogical perspective, but it is only the starting point for working on a real-world problem with its entirety of context.


Consent, Algorithmic Disgorgement and Machine Unlearning

November 28, 2022

Good afternoon Señor Greg Rutkowski. I’ve been giving some talks recently that are not associated with published work. I think it might be useful to have some of that content posted online, so here we go. This one is from a presentation I gave to the Future of Privacy Forum in September.

There is growing interest in the public policy world about algorithmic disgorgement: the required destruction of machine learning models that were trained on data that they weren’t supposed to be trained on. This interest stems from the Federal Trade Commission ordering Weight Watchers to destroy a model trained on data from unconsenting children users of a healthy eating app in March.

To understand the concept and its implications, there are three relevant facts:
1. training machine learning models can take weeks;
2. machine learning models contain imprints of their training data points, which can be extracted using clever techniques; and
3. deleting training data once a model has already been trained is not useful.

I had not heard the term algorithmic disgorgement before I was asked to do the presentation. As I was doing my research, I came across the article “Algorithmic Destruction” by Tiffany C. Li. One of the nice quotes in that article is “What must be deleted is the siloed nature of scholarship and policymaking on matters of artificial intelligence.” To this point, even though I was speaking to policymakers, I did not want to shy away from a little bit of relevant math to make sure they understand what is really going on. It would be a disservice to them otherwise.

The derivative of a function f′(x) tells you its slope. The slope is known as the gradient for multi-dimensional functions ∇f(x).

If you want to get down from the summit of Mount Everest the fastest you can, you always want to keep going down the steepest part that you can. That is known as gradient descent.

Machine learning models are mathematical functions and many machine learning models are trained using a version of gradient descent.

Specifically, an algorithm takes a labeled training data set {(x1,y1),(x2,y2),(x3,y3),…,(xn,yn)} and produces a model by minimizing a loss function L(f(x)) by performing gradient descent on the loss function. The labeled training data set may be historical loan approval decisions about real people made by loan officers. When you’re taking small step after small step walking down Mount Everest, you can imagine that it takes a long time. Similarly, despite advances in algorithms and hardware accelerators, it can take weeks to train a large model.

Different algorithms have different loss functions, which means that they have different Mount Everests underneath and yield different models. One kind of algorithm is a neural network. Depending on the algorithm used to train them, models can have a little or a large imprint of the training data. Some of the models below are really jagged and bend around individual data points. It is in these models that the imprint is large.

Through sophisticated methods known as model inversion attacks, it is possible to get a good idea of what the training data points were, just from the model. And how do model inversion attacks work? Why of course by gradient descent! (GD in the picture below.) They’re able to figure out a training data point by taking small steps toward what the model is confident about.

Once we have a trained model, we’re in a ‘fruit of the poisonous tree’ situation. The training data is the tree and the model is the fruit. Cutting down the tree — deleting the training data — does not help us remove the poison from the fruit — exclude the imprint of tainted training data points.

All these facts seem to imply that if a data point should not have been in the training set, then to guarantee its information cannot be retrieved, we must re-train the model from scratch without the data point in question, which may be computationally unreasonable.

But not so fast my friends! There is a new category of technical approaches known as machine unlearning that can come to the rescue. They are ways to get a new model equivalent to training from scratch without an objectionable data point, but in a way that does not involve having to compute too much. There are two main approaches:
1. being smart about structuring the training process (like a Chinese wall) so that only a small piece of the model has to be retrained, and even then, only from very close to the bottom of Mount Everest; and
2. using gradients!

In the gradient-based approach, you can figure out the influence of specific (tainted) training data points by tracing back how they created the underlying Mount Everest or loss landscape, and zeroing out their influence without having to retrain the model at all. (We have a paper that Prasanna is presenting tomorrow at NeurIPS that does something similar, but for training data points that lead to unfairness, rather than ones that have a consent issue.)

To the best of my knowledge, for all the policy discussion happening around algorithmic disgorgement, no one had connected the problem to the technical solution of machine unlearning. Similarly, in the burgeoning technical research direction of machine unlearning, no paper I am aware of uses the specific policy issue or terminology as a starting point. We need more people to heed Tiffany C. Li’s call for breaking the silos between policymaking and AI scholarship. It might also be a good idea to have some standardized traceable protocols for how objectionable data points are tagged and then have their influence removed from models.


Special Place

November 6, 2022

Good day sir. In my previous post, I referred to a couple of recent papers by Chetty et al. that study economic mobility and connectedness in the context of social capital. One of the ‘al.’ in the author list of those papers is Monica Bhole, who was a little younger than us growing up in the Indian community in Syracuse. This community was a special place (and not because it is the birthplace of the Indian-American speller who initiated a long chain of Indian-American winners of the national spelling bee and birthplace of the first Indian-American Miss America). The nurturing we had in this diaspora community and its effect are not captured in the Chetty et al. kind of studies because they miss out on a milieu that is not simply pairwise friendship relations and because they are limited to income as a success variable. Pioneers thrown into the deep end of a foreign country and culture, and having only each other to rely upon is a unique experience. I think you’ve found some of this unique experience in your cohort of White House Fellows, and I gleaned that this has been a positive for you.

I kind of spoke of this in late August at the memorial service of Mrs. Ashutosh at Drumlins:

This is an event I am sad to attend because of the loss that it represents. But it is also an event I am happy to attend because it gives us all a chance to reunite as an extended family. This diasporic extended family that gave us all the encouragement to become who we were supposed to become, authentically. Poetry, drama, argument, criticism, rhythm, melody, roots, knowledge of the world, and even silent reflection have always been around us because of the tone that auntie set.

But I just as easily could have been speaking about any of the women we have had in our lives, including Mamma, who we lost in May. I was talking with Sunita about this later at Drumlins: the trope of an inauthentic, gossipy, Indian auntie drawn to conspicuous consumption is far removed from our experiences. We have known strong, honest, conscientious, hard-working, interesting, beautiful aunties. We need to tell these stories and share the experiences that shaped us and our values. Journaling and blogging are known to be therapeutic, and I will start doing some more of it now.