Be just or be data-driven?

There was some Amazon-bashing going on at Hacker News lately and one of the commenters came up with a story about a guy whose business used Amazon for sales and advertising. One day Amazon permanently revoked his account, no explanation given. The comment describes the consequences: "This person lost everything they built over four years of hard work and had to file for bankruptcy when the commitments made by the business to support a multi-million-dollar-per-year supply pipeline caught-up with what had happened."

This particular account may or may not be true, but if you are reading tech press, every now and then you read a similar horror story. Big internet company that acts as a gatekeeper to a specific commercial or social sphere screws something up and some little person bears the consequences.

The interesting part is that it's never an act of bad will or an attempt to harm someone. It's just that the company's software have erred and at scale of millions of users there's no way (or at least no cheap and scalable way) to deal personally with each individual user, answer the complaints or even check what have actually happened.

You may have various opinions on the topic. Many people would reason that if the service you are relying on is free there's no obligation for the company to provide it for you. They can stop at any time they wish.

Later on that day, however, I've listened to a talk by Tim O'Reilly. Among other things he mentioned that the bail system in US is broken in that it keeps of rich out of pre-trial incarceration while keeping the poor in. He goes on explaining that they've built a data model that predicts who is safe to let out and who is not. That way the money is kept out of the process.

Now, that may sound weird. So you are going to rot in jail while your buddy will be released just because some software said so? But you can still make an argument for the system: First, even if it performs poorly, it's still better than bail. Second, you can think of it as a lottery. State is entitled to keep you in pre-trial incarceration. If they let you go it's a privilige, not an obligation. And such a priviliges can be administered by means of lottery. Denmark, after all, used to depend on lottery for military conscription and I never heard any complaints.

To bring the argument further, let's assume that the software, both the one revoking accounts in Amazon and the one used by the local judge, is based on machine learning. Specifically, neural networks.

Neural network is a beast that looks like it have escaped from a gedankenexperiment: It takes inputs, generates some more or less reasonable outputs, but nobody has any idea of what's going on in the middle. In our case inmate's data go in and the machine says either 'yes' or 'no'. Trying to find out why it have decided for either option is a futile experiment.

So for we are good. The inscrutability of the process plays nice with the perception of the whole thing as a lottery.

However, imagine someone collects the outputs from the machine and after a while realises that most of people ordered to be stay in jail are black. The white people, on the other hand, are mostly released. The damned box is a racist!

But the designers of the neural network certainly haven't put any prejudice against black people into the system! They've just assembled some neuron-like components into a mesh and that's it. Also, we have no way to find out what the neural network "thinks". There's no way to say whether it's driven by the colour of the skin or by some other, unknown, factors.

What now?

Are we going to give up on data-driven methodology? If so, the decisions we do will likely be much worse.

Or are we going to give up on justice, crashing random small people under the wheels of the data-driven approach?

And whatever your answer is, keep in mind that it applies both to the jail case and the Amazon case.

August 19th, 2015

Discussion Forum