You’ll have noticed there is something of a backlash against the tech giants these days. In the wake of the scandal over Cambridge Analytica’s alleged unauthorised use of personal data about millions of Facebook users, Mark Zuckerberg was subjected to an intensive grilling by the US Senate. (In the event, the questions were about as lacerating as candy floss because it turns out that most Senators have a pretty modest level of understanding about how social media works.) More seriously, perhaps, the share prices of the tech giants have tumbled in recent weeks, although Mr Zuckerberg’s day in Congress raised Facebook’s sufficiently to increase its CEO’s personal wealth by around $3bn. Not a bad day’s pay.
The tech firms are in the dog house for a number of reasons, but one of the most pressing is the widespread perception that they are using – and enabling others to use – algorithms which are poorly understood, and causing harm.
The word “algorithm” comes from the name of a ninth-century Persian mathematician, Al-Khwarizmi, and is surprisingly hard to explain. It means a set of rules or instructions for a computer to follow, but not the precise, step-by-step instructions which a computer programme contains. A machine learning algorithm uses an initial data set to build an internal model which it uses to make predictions; it tests these predictions against additional data and uses the results to refine the model.
If the initial data set is not representative of the population (of people, for instance) which it will offer decisions about, then those decisions can be prejudiced and harmful. When asked to provide pictures of hands, algorithms trained on partial data sets have returned only pictures of white hands. In 2015, a Google algorithm reviewing photographs labelled pictures of black people as gorillas.
When this kind of error is made when dealing with decisions about who should get a loan, or be sent to jail, the consequences can obviously be serious. It is not enough to say (although it is true) that the humans being replaced by these systems are often woefully prejudiced. We have to do better.
Algorithms’ answers to our questions are only as good as the data we give them to work with. To find the best needles you need really big haystacks. And not just big: you need them to be diverse and representative too. Machine language researchers are very well aware of the danger of GIGO – garbage in, garbage out – and all sorts of efforts and initiatives are under way in the tech giants and elsewhere to create bigger and better datasets.
Society has to walk a tightrope regarding the use of data. Machine learning and other AI techniques already provide great products and services: Google Search provides us with something like omniscience, and intelligent maps tell you how long your journey will be at different times of day, and divert you if an accident causes a blockage. In the future they will provide many more wonders, like the self-driving cars which will finally end the holocaust taking place continuously on our roads, killing 1.2 million people each year and maiming another 50 million or so.
The data to fuel these marvels is generated by the fast-growing number of sensors we place everywhere, and by the fact that more and more of our lives are digital, and we leave digital breadcrumb trails everywhere we go. It would be a tragedy if the occasionally hysterical backlash against the tech giants we are seeing today ended up throttling the availability of data which is needed to let ML systems weave their magic – and do so with less bias than we humans harbour.
Left to their own devices, it is likely that most of us would carry on using Facebook and similar services with the same blithe disregard for our own privacy as they always have done. But the decisions about how our data is used are not ours alone to make. Quite rightly, governments and regulators will make their opinions known. The European Union’s GDPR (General Data Protection Regulation), which comes into force in May, is a powerful example. It requires everyone who stores or processes the personal data (liberally defined) of any EU citizens to do so only when necessary, and to provide the subjects of that data with access to it on demand.
This is all well and good, but as we call for “something to be done” to curb the power of the tech giants, let’s bear a few things in mind. First, regulation is a blunt weapon: regulators, like generals, often fight the war that has just finished, and as technology accelerates, this effect will become more pronounced. Second, regulation frequently benefits incumbents by raising barriers to entry, and by doing so curbs innovation. In general, regulation should be used sparingly, and to combat harms which are either proven, or virtually inevitable.
Third, the issue is who controls or has access to the data, not who owns it. “Ownership” of a non-rivalrous, weightless good like data is a nebulous idea. But access and control are important. More regulation generally means that the state gains more access and control over our data, and we should think carefully before we rush in that direction. China’s terrifying Social Credit system shows us where that road can lead: the state knows everything about you, and gives you a score according to how you behave on a range of metrics (including who your friends are and what they say on social media). That score determines whether you have access to a wide range of benefits and privileges – or whether you are punished.
As our data-fuelled future unfolds, there will be plenty more developments which oblige us to discuss all this. At the moment the data we generate mostly concerns places we go, people we talk to, and things we buy. In future it will get more and more personal as well as more and more detailed. Increasingly we’ll be generating – and looking to control the use of – data about our bodies and our minds. Then the debates about privacy, transparency and bias will get really interesting.