What is the quickest way to calculate the number of points in a radius for each line of a large dataframe?

Question

I have around 4 million rows of data that have been collected mainly around big cities and touristic places. The problem I have is that I want to be able to compare data in undersampled locations to data in oversampled location. So I want to be able to know the sampling density around each sample. To do so I want to know how many samples have been made in a 30km radius around each sample.

I planned to divide the area covered (France) into two grids, the first one being on the even latitudes and longitudes and the second one on the odd latitudes and longitudes, and I added columns to the dataframe in which I put for each row the 2 grid coordinates. The goal was to reduce the number of comparisons done by the computer, it only calculated the distance between rows in the same cell. I did two grids so that the samples on a cell border in one grid would be around the middle in the other grid, so for each row, the highest number would really be the number of samples around. However, my computer still isn't able to run it because it is too long. So should I do thiner grids or is there a better way to optimise this calculation?

What is the quickest way to calculate the number of points in a radius for each line of a large dataframe?

Answers (1)

Related Questions