Sharing in the Age of Platforms and Machine Learning: Public Data vs. Training Data & Why We Should Know the Difference

6 min readJan 14, 2019

Over the weekend, a meme was trending on Facebook asking folks to share two photos of themselves. The prompt was to show “how aging has hit you…” and it caught my attention.

What I posted in response also took off and has now spread. Here’s what I said:

Hey gang — this post two photos of yourself thing feels like a way to gather a data set to train a machine learning classifier for facial recognition to be more accurate in ID’ing people based on an old photo.
As a machine learning researcher let me just say that paired photos associated with the same name with a positive ID and the time interval is the gold standard data set to do this.
I know of similar projects that have used photos available in the public record to do similar classifier training (celebrity photos, yes, but also mug shots) because they share the data characteristics I mentioned above.
The risk would not be to you individually. This is more of a herd immunity issue. The better these things get at IDing us from any particular photo, the closer we are to the world of John Anderton in Minority Report.

A Conversation About Public Data, Privacy, and Risk

As folks shared and responded, the conversation turned into what I saw as a chance to talk about what it means to share your data with folks who are not only interested in you, personally. I pulled some of those issues from the discussion thread into the post in an update:

What is the Point of a Campaign Like This, if That’s What It Is?

I’ve seen a few folks note that these photos are out there anyway…it’s true. But it would take some human hands and work to aggregate and clean the data. This is called “supervised” machine learning and it is expensive, tedious, and time-consuming. The goal of a campaign like this would be to essentially crowdsource that effort in order to boost the chances of “unsupervised” machine-learning. Essentially this is creating a very high signal-to-noise ratio in an extant data set.

Q: “Would it rely on the hashtag?”

That would help to ID an instance for inclusion. A hashtag is a nice loud signal for a human or a ‘bot. But it wouldn’t be strictly necessary because the ‘bot could also look for structural characteristics that are signaled in the markup (and which most users wouldn’t think much about). For an inclusion pass, we’d rank these indicators with the hash carrying the most weight and the structural data carrying less. Each instance would get a score that told us how confident our classifier is that the instance belongs in the data set. We’d set a minimum threshold and throw out anything below that.

Yes, but Why? Is this a big Conspiracy?

I know folks are antsy about everything being a vast conspiracy. But this is not that. It could just be about making money…

A word about motive, since I’m seeing some discussion of that. My point on that would be that we don’t need any kind of ideological motive to understand why someone might mount a campaign like this. Gathering this sort of data set using supervised learning (see comment above this one) would be very costly. But if you can dramatically reduce the cost and produce a high signal-to-noise facial recognition data set like this, you could sell it to all sorts of folks. Good training data is hard to come by. Downstream there might be folks with nefarious motives. But prior to that, it just looks like a business plan.

And it could be, for those working on computer vision and facial recognition, about advancing science…a colleague reminded me of that. And it came up later too as the conversation turned to a longer term issue: how we build awareness of the way data we share on platforms today can be used.

Anticipating How Your Data Will Be Re-Used in Training Machine-Learning Classifiers: A New Digital Literacy Concern

As the post spread throughout the day, I tried to clarify my position. My post wasn’t a call of alarm, but a heads up…so I posted a new motto:

New motto: Keep training data messy and expensive!

Yes, I was trying to be funny. I wanted to convey that I know that opting out of one meme doesn’t change the world we live in. I also started to notice that folks were missing a key distinction that I was, perhaps, assuming they knew….so I posted a clarification:

On Training Data vs. Public Data

A takeaway for those of you preparing digital literacy, digital rhetorics, and critical DH courses from seeing my post yesterday spread: a key distinction many folks do not make is between public data and human-verified training data.
Public data is everywhere and there are risks to putting yours out there. No question about it.
Training data, on the other hand, is prepared specifically to have high concentrations of a few specific features in a sample that accurately represents a specific population (or a “universe” as it is sometimes called in machine-learning parlance if it doesn’t reference people, but a text corpus).
Machine learning classifiers need high-quality training data to make sense of public data.
What I see folks saying is that because public data exists, there is no more risk in sharing a meme that looks — to my eye — to have the specs for a training set. I agree there is no more risk to the individual. But individuals should be aware that they could be contributing to a project (building a valuable asset) that they may not want to be part of.
Facial recognition is already quite advanced. But there are a few weak spots. This data set looks to be aiming to address one of those.
So you could help to do that if you want to. And you should know that it’s not the same as when you swipe your red card at Target or do any of a thousand other things that gather data about you and that you have to assess the risks of.

Even folks who know a lot about how technology works can miss this key distinction. I had this exchange with someone with coding experience:

Coder: (paraphrasing): It’s not that hard to do facial recognition now. I’ve done machine learning before…I’ve built programs that do it for a class, in fact using this library [links to a Stack Overflow thread].
Me: So have I. I’m not saying it’s not out there. I’m saying this campaign makes it less expensive. Public data is one thing. Data that accurately samples your universe with adequate signal strength on your key parameters to train is another, wouldn’t you agree?
Also Me: If you want to talk more specifics…one statistical blind spot in facial recognition comes from using eigenvectors to extract features (a.k.a. eigenfaces). This turns faces into vectors. A known weakness of this approach is extracting facial features that distinguish aging over time which could be used to match the same person…because there aren’t enough distinctions to train for accuracy. The result is good recall but poor precision. (i.e. lots of false positives).

And there is the issue I alluded to above regarding “advancing science.” This meme, whoever conceived it, could easily just be part of a plan someone has to solve a thorny technical problem. The library the coder used had a training set — whether he realized that or not — with some known weaknesses. Crowdsourcing high-quality training data to address a known challenge in facial recognition would be a reasonable first step to take if that’s your frame of reference.

But this is where folks discounting the risk (resistance is futile!) and those looking to advance Science (capital S) are perhaps both being short-sighted. The question we should ask is how will better facial recognition be used? And do we want to contribute to the efforts to advance this technology without having a discussion about those uses?

At a minimum, I don’t think we want to contribute without knowing to what, in fact, we are contributing.