Reputation: 103
I have a dataframe from source data that resembles the following:
In[1]: df = pd.DataFrame({'test_group': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'test_type': [np.nan,'memory', np.nan, np.nan, 'visual', np.nan, np.nan,
'auditory', np.nan]}
Out[1]:
test_group test_type
0 1 NaN
1 1 memory
2 1 NaN
3 2 NaN
4 2 visual
5 2 NaN
6 3 NaN
7 3 auditory
8 3 NaN
test_group
represents the grouping of the rows, which represent a test. I need to replace the NaNs in column test_type
in each test_group
with the value of the row that is not a NaN, e.g. memory, visual, etc.
I've tried a variety of approaches including isolating the "real" value in test_type
such as
In [4]: df.groupby('test_group')['test_type'].unique()
Out[4]:
test_group
1 [nan, memory]
2 [nan, visual]
3 [nan, auditory]
Easy enough, I can index into each row and pluck out the value I want. This seems to head in the right direction:
In [6]: df.groupby('test_group')['test_type'].unique().apply(lambda x: x[1])
Out[6]:
test_group
1 memory
2 visual
3 auditory
I tried this among many other things but it doesn't quite work (note: apply and transform give the same result):
In [15]: grp = df.groupby('test_group')
In [16]: df['test_type'] = grp['test_type'].unique().transform(lambda x: x[1])
In [17]: df
Out[17]:
test_group test_type
0 1 NaN
1 1 memory
2 1 visual
3 2 auditory
4 2 NaN
5 2 NaN
6 3 NaN
7 3 NaN
8 3 NaN
I'm sure if I looped it I'd be done with things, but loops are too slow as the data set is millions of records per file.
Upvotes: 2
Views: 730
Reputation: 78650
Under the assumption that there's a unique non-nan value per group, the following should satisfy your request.
>>> df['test_type'] = df.groupby('test_group')['test_type'].ffill().bfill()
>>> df
test_group test_type
0 1 memory
1 1 memory
2 1 memory
3 2 visual
4 2 visual
5 2 visual
6 3 auditory
7 3 auditory
8 3 auditory
edit:
The original answer used
df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill')
but it looks like according to schwim's timings ffill
/bfill
is significantly faster (for some reason).
Upvotes: 1
Reputation: 20669
You can use GroupBy.size
to get the size of each group. Then boolean index using Series.isna
. Now, use Index.repeat
with df.reindex
repeats = df.groupby('test_group').size()
out = df[~df['test_type'].isna()]
out.reindex(out.index.repeat(repeats)).reset_index(drop=True)
test_group test_type
0 1 memory
1 1 memory
2 1 memory
3 2 visual
4 2 visual
5 2 visual
6 3 auditory
7 3 auditory
8 3 auditory
Benchmarking dataframe:
df = pd.DataFrame({'test_group': [1]*10_001 + [2]*10_001 + [3]*10_001,
'test_type' : [np.nan]*10_000 + ['memory'] +
[np.nan]*10_000 + ['visual'] +
[np.nan]*10_000 + ['auditory']})
df.shape
# (30003, 2)
Results:
# Ch3steR's answer
In [54]: %%timeit
...: repeats = df.groupby('test_group').size()
...: out = df[~df['test_type'].isna()]
...: out.reindex(out.index.repeat(repeats)).reset_index(drop=True)
...:
...:
2.56 ms ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# timgeb's answer
In [55]: %%timeit
...: df['test_type'] = df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill')
...:
...:
10.1 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Almost ~4X faster. I believe it's because boolean indexing is very fast. And reindex + repeat is lightwieght compared to dual fillna.
Upvotes: 2