Reputation: 25
I have a dataframe with large number of lat/lon points (305000). I want to reduce the size of my dataframe by taking, each iteration, a sample and calculate the haversine distance between each consecutive rows. If the distance is too small I want to delete one of the two points. How can I do this in python? I wanted to use shift() but I don't know the wright way to use it. This what I am trying to do.
rows=random.sample(df.index,50)
for i in range(50):
rows = np.random.choice(df.index.values, 1000)
sampled_df = df.ix[rows]
if haversine(sampled_df,sampled_df.shift()) < e
delete one row
Upvotes: 1
Views: 212
Reputation: 7806
The big questions are "why you would want to do that?" and "what would it gain you once you are finished?" (besides speed). The problem with your approach is deciding which of the 2+ to delete. The answer to how to approach this lies in the big questions. I would suggest one of a few approaches. Do you want to be left with a center point? a representative point?
A few implementation suggestions: Use a groupby or mask instead of deleting data. For speed reasons: try to avoid using for statements in Pandas.
Upvotes: -1
Reputation: 2762
How about using a masked array and setting the mask value to true for each point you remove?
Upvotes: 1