In case you weren’t paranoid enough, this one will help: research conducted at the Massachusetts Institute of Technology (MIT) has shown that just four indistinct pieces of information can provide enough data to identify most people’s identity and credit card details.
In a data set recording more than one million users' credit card transactions for three months, the researchers were able to identify 91 per cent of people's identities and credit card information based either on details of what and where someone made four credit card purchases or one receipt, one Instagram photo of you eating a meal somewhere with friends and one tweet about an item you just bought.
With this information, they could extract your identity and credit card records 94 per cent of the time.
So how did they do it? Any purchases made with the same credit card were ‘tagged’ with a randomly generated identification number. This was how they identified each customer in the overall data set.
The researchers then picked out purchases at random and determined how many other customers’ purchase histories shared identical data points.
Experiments
In numerous different experiments the researchers varied the number of data points per customer from two to five.
Even without any price information, just two data points was still enough to recognise more than 40 per cent of the people in the data set. On the other end, five points with price information was enough to identify pretty much everyone.
“The question we started with was: if a data set is anonymous, is it possible to re-identify individuals within it?” says Yves-Alexandre de Montjoye, an MIT graduate student and co-author of the research.
“What do I need to know about a specific person to be able to identify them? Well, if we know they bought four items in four different shops at four different times, that is enough to identify them from a huge data set.
“What we showed to be possible is really the extent to which you can take a data set that seemed to be anonymous, and identify individuals from very limited information. We didn’t need names or phone numbers or addresses.”
Previous studies had made similar findings based on mobile phone data. The MIT research team wanted to see if the findings could be replicated with other kinds of metadata, such as credit card details or transportation data.
“In other words, is this specific to mobility or mobile phone data?” says de Montjoye. “Or is all metadata at risk? Well the answer is: ‘Yes it is’.”
Data breaches
We already know the digital grid is connected in ways difficult to fathom, and that those who want to make trouble online can. It just seems so very easy.
"Our online activity can be linked together because there are common sets of data across various media, from cashless transactions to social media and image capture," says Brian Bohan of security consultancy firm BH Consulting.
“You may have your mobile phone info connected to your credit card details, and then you may be tweeting from your phone, which has location services turned on by default.
"In some cases a picture that you tweet or update onto Facebook tells where you updated from."
Countless examples of online data privacy infringement have been reported for over a decade now. Perhaps the most notorious was in 2012, when US supermarket Target unintentionally revealed a teenager’s secret pregnancy to her parents by sending her maternity coupons (based on analysis of her retail habits at the store).
"There will always be examples like this," says Prof Alan Smeaton, director of the Insight Centre for Data Analytics.
“What the Target example and this credit card example and countless others show is that by cross-checking and correlating data from different sources, we can stitch facts together in ways that were not foreseen and thus find out things by the piecing together of the disparate sets of facts, like the teenager being pregnant, for example.”
We might immediately conclude that only online criminals and the like would actually partake in such nasty digital behaviour.
“The truth is this is the basis for a growing amount of online business,” says Bohan. “ Our digital footprint cannot be underestimated and it grows ever dirtier by the minute.
“Why do you think Facebook wanted to buy WhatsApp so much? For the phone numbers. It was a key piece of info they were lacking about their users which they can now link to many different systems.”
Intrusive
Prof Smeaton says: “Even universities use their own data on student admissions. A student’s CAO results can be combined with their catchment area, along with census information, travel distances using
Maps and public transport information.”
Is this type of intrusive online behaviour easy to carry out?
“Well ‘easy’ is relative,” says Bohan. “My brother is great with car engines and could change an oil filter in five minutes. I wouldn’t know where to start. So a certain amount of specialist knowledge is still required.”
There are, of course, some data protection laws in place.
“They’re not enforced to their full extent though,” says Bohan. “Plus you may be leaving your digital footprint in multiple jurisdictions. If you’re an Irish user surfing a website in China, which laws apply?”
So, even with years of highlighted online privacy infringements and plenty of legislation allegedly protecting us, the same rules apply. It is up to individuals to keep themselves safe online.
“If you leave your wallet on O’Connell St and someone picks it up, it’s the criminal’s fault for taking it. You shouldn’t have left it on the street though,” says Bohan.
“Likewise, if someone abuses your online info they’re the criminal, but there is a lot you can do to protect yourself. The more educated you are and aware of the threats, the more secure you can be, either from criminals or corporations trying to sell your info.”
Solutions
De Montjoye and his colleagues are currently working on a scalable framework to overcome the ease by which data breaches can take place.
“It’s called SafeAnswers,” he says. “This approach doesn’t try to anonymise and then share data. The data stays where it is and instead it allows third-parties to ask specific questions from the data and obtain specific answers: ‘How much time did it take you to get from home to work this morning?’, for example.
“It will only share the answer and not any information about the location or stops made along the way.”