champezius
champezius

Reputation: 21

NaN at First Position of Two Columns, By Each Unique Value

I'm trying to change the first observation for each unique ID in a dataframe to an NaN. I'm working with a dataframe of timestamps and coordinate points that are already sorted by unique ID and timestamp.

Sample:

    ID        timestamp   latitude  longitude
0    1   6/9/2017 11:20  38.795333  77.008883
1    1   6/9/2017 13:10  38.889011  77.050061
2    1   6/9/2017 16:23  40.748249  73.984191
3    2  6/11/2017 08:35  38.920602  77.222329
4    2  6/11/2017 10:00  42.366211  71.020943
5    2  6/11/2017 20:00  38.897416  77.036833
6    2  6/12/2017 07:30   38.851426  77.042298
7    2  6/12/2017 10:20  38.917346  77.222553
8    3  6/11/2017 09:01  40.782869  73.967544
9    3  6/11/2017 10:03  38.954268  77.449695
10   3  6/11/2017 11:48  38.872875  77.007763
11   3  6/12/2017 11:52  40.776931  73.876155

Attempt:

df['latitude'] =\
 df.groupby('ID')['latitude'].apply(lambda x: x[0].np.nan)
df['longitude'] =\
 df.groupby('ID')['longitude'].apply(lambda x: x[0].np.nan)

I suspected that although the df was already grouped, I would still need to use a groupby to operate by each unique ID. I'm having trouble thinking of how to access each first value, and then replacing them as NaN.

Which gives the error: KeyError: 0

Here's the desired output:

        ID        timestamp   latitude  longitude
    0    1   6/9/2017 11:20        NaN       NaN
    1    1   6/9/2017 13:10  38.889011  77.050061
    2    1   6/9/2017 16:23  40.748249  73.984191
    3    2  6/11/2017 08:35        NaN       NaN
    4    2  6/11/2017 10:00  42.366211  71.020943
    5    2  6/11/2017 20:00  38.897416  77.036833
    6    2  6/12/2017 07:30   38.851426  77.042298
    7    2  6/12/2017 10:20  38.917346  77.222553
    8    3  6/11/2017 09:01        NaN       NaN
    9    3  6/11/2017 10:03  38.954268  77.449695
    10   3  6/11/2017 11:48  38.872875  77.007763
    11   3  6/12/2017 11:52  40.776931  73.876155

Edit (Why do this?):

I'm trying to adapt a version of this this answer to calculate distance and velocity. Everything works great except that the first values of each lat/lon for each value are wrong because the function calculates on the rows, indiscriminate of the ID. Looking at different solution, I suspect I need to something similar to this... calculating for both the velocity and distance by using a concat and shift. This is kind of difficult for me to conceptualize though - so figured just replacing those values would be simpler than editing and rerunning - which why is I posed the question.

Upvotes: 1

Views: 54

Answers (2)

sobek
sobek

Reputation: 1426

Edit:

Stealing a bit from Moormanly, one can turn it into a oneliner:

df.loc[df.groupby('ID').head(1).index,
       ['longitude', 'latitude']] = float('nan')

This should do the trick:

indices = df.groupby('ID').head(1).index
df.loc[indices, 'latitude'] = float('nan')
df.loc[indices, 'longitude'] = float('nan')

Result:

       ID        timestamp   latitude  longitude
index                                           
0       1   6/9/2017 11:20        NaN        NaN
1       1   6/9/2017 13:10  38.889011  77.050061
2       1   6/9/2017 16:23  40.748249  73.984191
3       2  6/11/2017 08:35        NaN        NaN
4       2  6/11/2017 10:00  42.366211  71.020943
5       2  6/11/2017 20:00  38.897416  77.036833
6       2  6/12/2017 07:30  38.851426  77.042298
7       2  6/12/2017 10:20  38.917346  77.222553
8       3  6/11/2017 09:01        NaN        NaN
9       3  6/11/2017 10:03  38.954268  77.449695
10      3  6/11/2017 11:48  38.872875  77.007763
11      3  6/12/2017 11:52  40.776931  73.876155

Upvotes: 0

Moormanly
Moormanly

Reputation: 1438

Since your df is already sorted by the ID column, you can use the following trick to get the first occurrence of each unique ID as a boolean mask:

mask = df.ID != df.ID.shift()

Then set the corresponding data to NaN

df.loc[mask, ['latitude', 'longitude']] = np.nan

Upvotes: 1

Related Questions