Reputation: 1
I want to do an experiment involving the use of additive noise for protecting a database from inference attacks.
My database should begin by generating a specific list of values that have a mean of 25, Then I will anonymize these values by adding a random noise value, which is designed to have an expected value of 0.
For example:
I can use uniformly distributed noise in the range [-1,1] or use a Normal (Gaussian) noise with mean 0.
I will test this anonymization method for a database of 100, 1000, 10000 values with different noise.
I am confused to use which platform and how, So I started with 10 values in Excel, For uniformly distributed noise value I use RAND()
and add to the actual value, for normal noise, I use Norm.Inv
with mean 0, then I add to the actual value.
But I don't know how to interpret data from the hacker's side, When I add noise to the dataset, how can I interpret its effect on privacy when the dataset becomes larger?
Also, should I use a database tool to handle this problem?
Upvotes: 0
Views: 446
Reputation: 11
From what I understand, you're trying to protect your "experimental" database from inference attacks.
Attackers try to steal information from a database, using queries that are already allowed for public use. First, try to decide on your identifiers, quasi-identifiers and sensitive values.
Consider a student management system that has the GPA's of each student. We know that GPA is a sensitive information. Identifier is "student_id", and quasi-identifiers are "standing" and, let's say, "gender". In most cases, administrator of the RDBM system allows aggregate queries such as "Get average GPA of all students" or "Get average GPA of senior students" etc. Attackers try to infer from these aggregate queries. If, somehow, there are only one student who is senior, then the query "Get average GPA of senior students" would return the GPA of one specific person.
There are two main ways to protect the database from this kind of attacks. De-identification and Anonymization. De-identification means removing any identifier and quasi-identifier from the database. But this does not work in some cases. Consider one student who takes a make-up exam after grades are announced. If you get the average GPA of all students before and after he takes the exam, and compare the results of queries, you'd see a small amount of change (let's say, from 2.891 to 2.893). The attacker can infer the make-up exam score of that one particular student from this 0.002 difference of aggregate GPA.
The other approach is anonymization. With k-anonymity, you divide the database into groups that has at least k entities. For example, 2-anonymity ensures that there are no groups with single entity in it, so the aggregate queries on single-entity groups no longer leak private information.
Unless, you are one of the two entities in a group.
If there are 2 senior students in a class, and you want to know the average grade of seniors, the 2-anonymity allows you to have that information. But if you are a senior, and you already know your grade, you can infer the other student's grade.
Adding noise to sensitive values is a way to deal with those attacks, but if the noise is too low, then it has almost no effect on the information leaked (e.g. for grades, knowing someone has 57 out of 100 instead of 58 makes almost no difference). If it's too high, it results with a loss of functionality.
You've also asked how you can interpret the effect on privacy, as the dataset becomes larger. If you take average of an extremely large dataset, you'd see that the result you find is actually the sensitive value of everyone (this could be a little complex, but think the dataset as infinite, and the values that the sensitive information can take is finite, then calculate the probabilities). Adding noise with zero mean works, but range of the noise should get wider as the dataset get larger.
Lastly, if you are using excel, which is not a RDBMS but a spreadsheet, I suggest you to come up with a way to use equivalents of SQL queries, set identifiers and quasi-identifiers of the dataset, and public queries that can be performed by anyone.
Also, in addition to anonymity, take a look at "diversity" and "closeness" of a dataset and their use in database anonymization.
I hope that answers your question. If you have further questions, please ask.
Upvotes: 1