Anthony Calonne
Anthony Calonne

Reputation: 1

What is the quickest way to calculate the number of points in a radius for each line of a large dataframe?

I have around 4 million rows of data that have been collected mainly around big cities and touristic places. The problem I have is that I want to be able to compare data in undersampled locations to data in oversampled location. So I want to be able to know the sampling density around each sample. To do so I want to know how many samples have been made in a 30km radius around each sample.

I planned to divide the area covered (France) into two grids, the first one being on the even latitudes and longitudes and the second one on the odd latitudes and longitudes, and I added columns to the dataframe in which I put for each row the 2 grid coordinates. The goal was to reduce the number of comparisons done by the computer, it only calculated the distance between rows in the same cell. I did two grids so that the samples on a cell border in one grid would be around the middle in the other grid, so for each row, the highest number would really be the number of samples around. However, my computer still isn't able to run it because it is too long. So should I do thiner grids or is there a better way to optimise this calculation?

Upvotes: 0

Views: 45

Answers (1)

Ian Turton
Ian Turton

Reputation: 10976

You should put your data into PostGIS and use the dwithin operator. Make sure to add a spatial index to the data table and PostGIS will make sure that it only checks the nearby samples.

You should end up with a query something like:

SELECT t.id, COUNT(s.id)     
FROM   <table> AS s, <table> as t
WHERE  t.id <> s.id and ST_DWithin(t.geom, s.geom, <distance>)

Upvotes: 0

Related Questions