Unharnessing collective intelligence: A business model for privacy on Mobile devices based on k-anonymity

Web 2.0 – harnessing collective intelligence? – Do we want to be harnessed?

Web 2.0 has taught us the concept of harnessing collective intelligence . Companies like Google with Page Rank, Amazon (with Amazon reviews) and others have benefited from the idea of harnessing collective intelligence. So, the business model for the provider in harnessing collective intelligence is proven. We i.e. creators of data own the copyright to the individual data elements (for instance reviews of customers) but the providers own the value gained from harnessing that granular data. Providers of services would postulate that the granular data elements don’t hold commercial value – it is only the aggregated elements (harnessing collective intelligence) that has value.

Or Does it?

In other words, is there any value in the granular data as opposed to the aggregated data?

Let us put this in perspective,

a) Currently, providers can (largely) provide personalised services and some form of targeted advertising and also segmentation. That does not need customers to ‘own’ their own data

b) This blog is not about scaremongering, conspiracy theories or even privacy per se. The risks to privacy have been explored in detail many times in relation to Location based services etc. This discussion relates more to anonymity of data collected from individuals.

The real question is: Is there a model which would enable the providers and customers to both benefits if data is owned and managed by the customers themselves?

To explain this issue, we have to understand the problem of k-anonymity. The problem and solution of k-anonymity relates to re-identifying individuals from multiple datasets even if the data is (supposedly) anonymised. As we become creators of data with Web 2.0 and especially Mobile Web 2.0, the problem becomes significant because data is collected by providers at a phenomenal rate. It is then possible to potentially re-identify people from datasets. This discussion explores the possibility of making the problem of anonymization into a business opportunity.

Essentially, if data is anonymised at the source and is under the control of the customer, the customer will trust the provider who anonymises their data. In return for that trust, the customer could volunteer to reveal attributes about themselves which would enable the provider to create personalized advertising campaigns and also to be used in segmentation. This benefits both the providers (protection from legal action, personalized advertising, segmentation) and also the customers (anonymised data, personalised services etc)

k-anonymity

k-anonymity is summarised in a paper by LATANYA SWEENEY(k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY) School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA as pe the paper abstract

Consider a data holder, such as a hospital or a bank, that has a privately held collection

of person-specific, field structured data. Suppose the data holder wants to share a version of the data with researchers. How can a data holder release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful? The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment. A release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release. This paper also examines re-identification attacks that can be realized on releases that adhere to kanonymity unless accompanying policies are respected. The k-anonymity protection model is important because it forms the basis on which the real-world systems known as Datafly, -Argus and k-Similar provide guarantees of privacy protection.

Keywords: data anonymity, data privacy, re-identification, data fusion, privacy.

Full paper : http://privacy.cs.cmu.edu/people/sweeney/kanonymity.pdf

The problem -location based services but also for all data ..

As we become creators of data with Web 2.0 and especially Mobile Web 2.0, the anonymising of data becomes a problem. Historically, data has been anonymised by removing explicit identifiers such as name, address, telephone number etc. Such data looks anonymised but it may not be when co-related with another dataset which may help to uniquely identify people.

For example, as per the paper, in Massachusetts, the Group Insurance Commission (GIC) is responsible For purchasing health insurance for state employees. Because the data were believed to be anonymous, GIC gave a copy of the data to researchers and sold a copy to industry. It was then possible to co-relate it with normal voter registration and the rightmost circle in Figure 1 shows that these data included the name, address, ZIP code, birth date, and gender of each voter. This information can be linked using ZIP code, birth date and gender to the medical information, thereby linking diagnosis, procedures, and medications to particularly named individuals. For example, William Weld was governor of Massachusetts at that time and his medical records were in the GIC data. Governor Weld lived in Cambridge Massachusetts. According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his 5-digit ZIP code.

k-anonymity.jpg

source: k-anonymity is summarised in a paper by LATANYA SWEENEY(k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY)

The solution can be explained in an example from another paper

ℓ-Diversity: Privacy Beyond k-Anonymity Ashwin Machanavajjhala Johannes Gehrke Daniel Kifer Muthuramakrishnan Venkitasubramaniam Department of Computer Science, Cornell University http://www.cs.cornell.edu/~vmuthu/research/ldiversity.pdf

we divide the attributes into two groups: the sensitive attributes (consisting only of medical condition) and the non-sensitive attributes (zip code, age, and nationality). An attribute is marked sensitive if an adversary must not be allowed to discover the value of that attribute for any individual in the dataset. Attributes not marked sensitive are non-sensitive. Furthermore, let the collection of attributes {zip code, age, nationality} be the quasi-identifier for this dataset. Figure 2 shows a 4-anonymous table derived from the table in Figure 1 (here “*” denotes a suppressed value so, for example, “zip code = 1485*” means that the zip code is in the range [14850−14859] and “age=3*”means the age is in the range [30 − 39]). Note that in the 4-anonymous table, each tuple has the same values for the quasi-identifier as at least three other tuples in the table.

k-anonymity-1.JPG

source: ℓ-Diversity: Privacy Beyond k-Anonymity Ashwin Machanavajjhala Johannes Gehrke Daniel Kifer Muthuramakrishnan Venkitasubramaniam Department of Computer Science, Cornell University

The solution – a policy manager managed by the user which anonymises data

The solution is to have a policy manager designed to

a) Be controlled by the user i.e. user sets the policies

b) Manage all data – not just location

This is a potentially win-win situation because essentially, if data is anonymised at the source and is under the control of the customer, the customer will trust the provider who anonymises their data (and in turn protects them). In return for that trust, the customer could volunteer to reveal attributes about themselves which would enable the provider to create personalized advertising campaigns and also to be used in segmentation. This benefits both the providers (protection from legal action, personalized advertising, segmentation) and also the customers (anonymised data, personalised services etc)

Note the context again, it is not about privacy – it is about anonymising data.

Also, the approach potentially provides a compelling argument for both the provider and the customer. In doing so, it is different because at the moment, advertising, segmentation etc can be implemented on a best case basis – but a trust based approach will benefit all parties.

I called this Unharnessing collective intelligence for the lack of a better word – but we can think of it as ‘policy based anonymising of data’

Update: Also CDT to Obama: advent of “the cloud” makes privacy laws dated