schwim
schwim

Reputation: 103

Update column values in a group based on one row in that group

I have a dataframe from source data that resembles the following:

In[1]: df = pd.DataFrame({'test_group': [1, 1, 1, 2, 2, 2, 3, 3, 3],
         'test_type': [np.nan,'memory', np.nan, np.nan, 'visual', np.nan, np.nan,
         'auditory', np.nan]}
Out[1]:
   test_group test_type
0           1       NaN
1           1    memory
2           1       NaN
3           2       NaN
4           2    visual
5           2       NaN
6           3       NaN
7           3  auditory
8           3       NaN

test_group represents the grouping of the rows, which represent a test. I need to replace the NaNs in column test_type in each test_group with the value of the row that is not a NaN, e.g. memory, visual, etc.

I've tried a variety of approaches including isolating the "real" value in test_type such as

In [4]: df.groupby('test_group')['test_type'].unique()
Out[4]:
test_group
1      [nan, memory]
2      [nan, visual]
3    [nan, auditory]

Easy enough, I can index into each row and pluck out the value I want. This seems to head in the right direction:

In [6]: df.groupby('test_group')['test_type'].unique().apply(lambda x: x[1])
Out[6]:
test_group
1      memory
2      visual
3    auditory

I tried this among many other things but it doesn't quite work (note: apply and transform give the same result):

In [15]: grp = df.groupby('test_group')
In [16]: df['test_type'] = grp['test_type'].unique().transform(lambda x: x[1])

In [17]: df
Out[17]:
   test_group test_type
0           1       NaN
1           1    memory
2           1    visual
3           2  auditory
4           2       NaN
5           2       NaN
6           3       NaN
7           3       NaN
8           3       NaN

I'm sure if I looped it I'd be done with things, but loops are too slow as the data set is millions of records per file.

Upvotes: 2

Views: 730

Answers (2)

timgeb
timgeb

Reputation: 78650

Under the assumption that there's a unique non-nan value per group, the following should satisfy your request.

>>> df['test_type'] = df.groupby('test_group')['test_type'].ffill().bfill() 
>>> df
   test_group test_type
0           1    memory
1           1    memory
2           1    memory
3           2    visual
4           2    visual
5           2    visual
6           3  auditory
7           3  auditory
8           3  auditory

edit:

The original answer used

df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill') 

but it looks like according to schwim's timings ffill/bfill is significantly faster (for some reason).

Upvotes: 1

Ch3steR
Ch3steR

Reputation: 20669

You can use GroupBy.size to get the size of each group. Then boolean index using Series.isna. Now, use Index.repeat with df.reindex

repeats = df.groupby('test_group').size()
out = df[~df['test_type'].isna()]
out.reindex(out.index.repeat(repeats)).reset_index(drop=True)

   test_group test_type
0           1    memory
1           1    memory
2           1    memory
3           2    visual
4           2    visual
5           2    visual
6           3  auditory
7           3  auditory
8           3  auditory

timeit analysis:

Benchmarking dataframe:

df = pd.DataFrame({'test_group': [1]*10_001 + [2]*10_001 + [3]*10_001, 
                            'test_type' : [np.nan]*10_000 + ['memory'] +
                                          [np.nan]*10_000 + ['visual'] +
                                          [np.nan]*10_000 + ['auditory']})   
df.shape
# (30003, 2) 

Results:

# Ch3steR's answer
In [54]: %%timeit 
    ...: repeats = df.groupby('test_group').size() 
    ...: out = df[~df['test_type'].isna()] 
    ...: out.reindex(out.index.repeat(repeats)).reset_index(drop=True) 
    ...:  
    ...:                                                                        
2.56 ms ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# timgeb's answer
In [55]: %%timeit 
    ...: df['test_type'] = df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill') 
    ...:  
    ...:                                                                                                                 
10.1 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Almost ~4X faster. I believe it's because boolean indexing is very fast. And reindex + repeat is lightwieght compared to dual fillna.

Upvotes: 2

Related Questions