Reputation: 49
10 million user gps data, structure like:
userId
startGps
endGps
one user has two gps, start point and end point. if a distance of two points from different user larger than 1km. we'll defined there users are potentially close relation.
userA startGpsA endGpsA
userB startGpsB endGpsB
function relation(userGpsA A, userGpsB B)
if(distance(A.startGps , B.startGps) > 1km || distance(A.startGps , B.endGps) > 1km || distance(A.endGps , B.endGps) > 1km)
return A,B
return null
how could i find these relation fast?
Upvotes: 0
Views: 65
Reputation: 10721
One possible algorithm use spacial 'buckets' to reduce computation time. It will not do a special threading tricks, but will reduce a lot the amount of User to compare (depending of the size of the bucket).
The idea is to put in the same 'buckets' every user that are already not so far from each other, and create an index on 'buckets' that allow to get adjacent 'buckets' at low cost.
Let's assume we have
class User{
long userId;
GPS start;
GPS stop;
}
class GPS{
long x;
long y;
}
First we create a class for indexed User :
class BucketEntity implements Comparable<BucketEntity>{
User origin;
long x;
long y
}
class Bucket extends Set<BucketEntity {
}
For each User we will create two BucketEntity, one for 'start' and one for 'end'. We will store thoses BucketEntity into a specialy indexed data structure that allow easy retrival of nearest other BucketEntity.
class Index extends ConcurrentHashMap<BucketEntity,Bucket> {
// Overload the 'put' implementation to correctly manage the Bucket (null initialy, etc...)
}
All we need is to implements 'hash' (and 'equals' method in the BucketEntity class. The specification for 'hash' and 'equals' is to be the same if for two BucketEntity if they are not so far from each other. We also want to be able to compute the 'hash' function of Bucket that are spacially adjacent to another Bucket, for a given BucketEntity.
To get the correct behavior for 'hash' and 'equals' a nice/fast solution is to do 'precision reduction'. In short if you have 'x = 1248813' you replace it by 'x=124' (divide by 1000) it like changing your gps-meter precision to a gps-kilometer precision.
public static long scall = 1000;
boolean equals(BucketEntity that)
{
if (this == that) return true;
if (this.x / scall == that.x / scall &&
this.y / scall == that.y / scall)
return true;
return false;
}
// Maybe an 'int' is not enough to correctly hash your data
// if so you have to create you own implementation of Map
// with a special "long hashCode()" support.
int hashCode()
{
// We put the 'x' bits in the high level, and the 'y' bits in the low level.
// So the 'x' and 'y' don't conflict.
// Take extra-care of the value of 'scall' relatively to your data and the max value of 'int'. scall == 10000 should be a maximum.
return (this.x / scall) * scall + (this.y / scall);
}
As you can see in the hashCode() method, Bucket that are close to each other have really near hashCode(), if I give you a Bucket, you can also compute the spacially adjacents Bucket hashCode().
Now you can get BucketEntity(ies) that are in the same Bucket as your given BucketEntity. To get the adjacent bucket you need to create 9 virtual BucketEntity to 'get()' Bucket/null that are around the Bucket of your BucketEntity.
List<BucketEntity> shortListToCheck = // A List not a Set !
shortListToCheck.addindex.get(new BucketEntity(user, (x / scall)+1 , (y/scall)+1 )));
shortListToCheck.addindex.get(new BucketEntity(user, (x / scall)+1 , (y/scall) )));
shortListToCheck.addindex.get(new BucketEntity(user, (x / scall)+1 , (y/scall)-1 )));
shortListToCheck.addindex.get(new BucketEntity(user, (x / scall)+1 , (y/scall)+1 )));
shortListToCheck.addindex.get(new BucketEntity(user, (x / scall) , (y/scall) )));
shortListToCheck.addindex.get(new BucketEntity(user, (x / scall)-1 , (y/scall)-1 )));
shortListToCheck.addindex.get(new BucketEntity(user, (x / scall)-1 , (y/scall)+1 )));
shortListToCheck.addindex.get(new BucketEntity(user, (x / scall)-1 , (y/scall) )));
shortListToCheck.addindex.get(new BucketEntity(user, (x / scall)-1 , (y/scall)-1 )));
get() all Buckets that match the 9 virtual BucketEntry(can be null). For each of the User of the given 9 Buckets, really compute the distance by the way you provide in your question.
Then play with the 'scall'. Has you can see, there is no real constraint on multi-threading here. Maybe the next level of algorithm optimization is the adaptative/recursive bucket-size based on an adaptative scaling-size.
Upvotes: 2