Reputation: 1013
We used the average of multiple geography points to get the average and assign that as an items location. I now need to trace through this data and somehow identify any of the original points which is too far away from the rest. I reckon I'll need to use STDistance but concerned that to do it right I'll need 2 cursors to loop through and compare every point against every other.
In case an example helps clarify: We needed to lock in the GPS coordinates of an item so we asked the user to press a button whenever they were in the approximate area of the item, and the GPS co-ords were logged. We then took 5 of those and set the item's location to the AVG(Location.Lat) + AVG(Location.Long). Issue arises when one of those user presses was done miles away, throwing average out, and so now need to identify any of these.
Any ideas on correct/efficient way to do this in SQL only? (dealing with millions of entries so worried that looping through 2 cursors per item will cripple database)
Upvotes: 0
Views: 809
Reputation: 24498
Throwing away statistically insignificant data is easy for human but can be a bit troublesome for computers. This is especially true in your situation because you are dealing in two dimensions (latitude and longitude).
I would suggest that you take a look at this blog I wrote several years ago: Calculating Mean Median and Mode With SQL Server.
With Latitude and Longitude, each digit after the decimal point represents a distance. What you could do is round the latitude and longitude to a certain number of decimal places, find the mode. Remove the points that are not the same as the mode, and then average the remaining unrounded items.
Since you are working in two dimensions, you'll need to do this separately for the latitude and longitude values because the latitude could be way off while the longitude is not (this would represent a bad point that is north or south of the actual location). Likewise, the longitude could be way off while the latitude is apparently ok. If either value is "bad", then you should discard the point completely.
Here is an example of what I am talking about:
Declare @Temp Table(Lat Decimal(9,6), Lon Decimal(9,6))
Insert Into @Temp Values(20.12341, 10.98731)
Insert Into @Temp Values(20.12342, 10.98732)
Insert Into @Temp Values(20.12343, 10.98733)
Insert Into @Temp Values(20.12344, 10.98734)
Insert Into @Temp Values(20.12344, 10.68734) -- Latitude OK, Longitude bad
Insert Into @Temp Values(20.32344, 10.98734) -- Longitude OK, Latitude bad
Insert Into @Temp Values(20.42340, 10.68730) -- Both are bad
Select Avg(Lat), Avg(Lon)
From @Temp
Select Avg(T.Lat) As Latitude,
Avg(T.Lon) As Longitude
From @Temp T
Inner Join (
-- Calculate the mode for the latitude
Select Top 1 Convert(Decimal(9,4), Lat) As ModeOfLat
From @Temp
Group By Convert(Decimal(9,4), Lat)
Order By Count(*) DESC
) As Latitudes
On Convert(Decimal(9,4), Lat) = Latitudes.ModeOfLat
Inner Join (
-- Calculate the mode for the longitude
Select Top 1 Convert(Decimal(9,4), Lon) As ModeOfLon
From @Temp
Group By Convert(Decimal(9,4), Lon)
Order By Count(*) DESC
) As Longitudes
On Convert(Decimal(9,4), Lon) = Longitudes.ModeOfLon
If you run the query above in a SQL Server Management Studio window, you will see that the simple average is considerably different than the mode+average method.
Since this is a set based method, it should be considerably faster than a looping/cursor approach.
Upvotes: 2