Reputation: 502
I'm working on a project with 500,000 participants. We have in our database the precise coordinates of their home, and we want to release this data to someone who needs it to evaluate how close our participants live to one another.
We are very reluctant to release the precise coordinates, because this is an anonymized project and the risk for re-identification would be very high. Rounded coordinates (to something like 100m or 1km) are apparently not precise enough for what they're trying to achieve.
A nice workaround would have been to send them a 500,000 by 500,000 matrix with the absolute distance between each pair of participants, but this means 250 billion entries, or rather 125 billion if we remove half the matrix since |A-B| = |B-A|.
I've never worked with this type of data before, so I was wondering if anyone had a clever idea on how to deal with this? (Something that would not involve sending them 2 TB of data!)
Thanks.
Upvotes: 1
Views: 75
Reputation: 15767
Provided that the recipient of the data is happy to perform the great circle calculation to calculate the distance themselves, then you only need to send the 500,000 lines, but with transposed latitudes and longitudes.
First of all identify an approximate geospatial centre of your dataset, and then work out the offsets needed to transpose this centre to 0°N and 0°E. Then apply these same offsets to the users' latitudes and longitudes. This will centre the results around the equator and the prime meridian.
Provided your real data isn't too close to the poles, the distance calculated between real points A and B will be very close to the corresponding offset points.
Obviously the offsets applied need to be kept secret.
This approach may not work if it is known that your data is based around a particular place - the recipient may be able to deduce where the real points are - but that is something you'll need to decide yourself.
Upvotes: 1