Reputation: 21
I'm trying to change the first observation for each unique ID in a dataframe to an NaN. I'm working with a dataframe of timestamps and coordinate points that are already sorted by unique ID and timestamp.
Sample:
ID timestamp latitude longitude
0 1 6/9/2017 11:20 38.795333 77.008883
1 1 6/9/2017 13:10 38.889011 77.050061
2 1 6/9/2017 16:23 40.748249 73.984191
3 2 6/11/2017 08:35 38.920602 77.222329
4 2 6/11/2017 10:00 42.366211 71.020943
5 2 6/11/2017 20:00 38.897416 77.036833
6 2 6/12/2017 07:30 38.851426 77.042298
7 2 6/12/2017 10:20 38.917346 77.222553
8 3 6/11/2017 09:01 40.782869 73.967544
9 3 6/11/2017 10:03 38.954268 77.449695
10 3 6/11/2017 11:48 38.872875 77.007763
11 3 6/12/2017 11:52 40.776931 73.876155
Attempt:
df['latitude'] =\
df.groupby('ID')['latitude'].apply(lambda x: x[0].np.nan)
df['longitude'] =\
df.groupby('ID')['longitude'].apply(lambda x: x[0].np.nan)
I suspected that although the df was already grouped, I would still need to use a groupby to operate by each unique ID. I'm having trouble thinking of how to access each first value, and then replacing them as NaN.
Which gives the error:
KeyError: 0
Here's the desired output:
ID timestamp latitude longitude
0 1 6/9/2017 11:20 NaN NaN
1 1 6/9/2017 13:10 38.889011 77.050061
2 1 6/9/2017 16:23 40.748249 73.984191
3 2 6/11/2017 08:35 NaN NaN
4 2 6/11/2017 10:00 42.366211 71.020943
5 2 6/11/2017 20:00 38.897416 77.036833
6 2 6/12/2017 07:30 38.851426 77.042298
7 2 6/12/2017 10:20 38.917346 77.222553
8 3 6/11/2017 09:01 NaN NaN
9 3 6/11/2017 10:03 38.954268 77.449695
10 3 6/11/2017 11:48 38.872875 77.007763
11 3 6/12/2017 11:52 40.776931 73.876155
Edit (Why do this?):
I'm trying to adapt a version of this
this answer to calculate distance and velocity. Everything works great except that the first values of each lat/lon for each value are wrong because the function calculates on the rows, indiscriminate of the ID. Looking at different solution, I suspect I need to something similar to this... calculating for both the velocity and distance by using a concat
and shift
. This is kind of difficult for me to conceptualize though - so figured just replacing those values would be simpler than editing and rerunning - which why is I posed the question.
Upvotes: 1
Views: 54
Reputation: 1426
Edit:
Stealing a bit from Moormanly, one can turn it into a oneliner:
df.loc[df.groupby('ID').head(1).index,
['longitude', 'latitude']] = float('nan')
This should do the trick:
indices = df.groupby('ID').head(1).index
df.loc[indices, 'latitude'] = float('nan')
df.loc[indices, 'longitude'] = float('nan')
Result:
ID timestamp latitude longitude
index
0 1 6/9/2017 11:20 NaN NaN
1 1 6/9/2017 13:10 38.889011 77.050061
2 1 6/9/2017 16:23 40.748249 73.984191
3 2 6/11/2017 08:35 NaN NaN
4 2 6/11/2017 10:00 42.366211 71.020943
5 2 6/11/2017 20:00 38.897416 77.036833
6 2 6/12/2017 07:30 38.851426 77.042298
7 2 6/12/2017 10:20 38.917346 77.222553
8 3 6/11/2017 09:01 NaN NaN
9 3 6/11/2017 10:03 38.954268 77.449695
10 3 6/11/2017 11:48 38.872875 77.007763
11 3 6/12/2017 11:52 40.776931 73.876155
Upvotes: 0
Reputation: 1438
Since your df is already sorted by the ID column, you can use the following trick to get the first occurrence of each unique ID as a boolean mask:
mask = df.ID != df.ID.shift()
Then set the corresponding data to NaN
df.loc[mask, ['latitude', 'longitude']] = np.nan
Upvotes: 1