Reputation: 2570
Say we have a dataframe set up as follows:
x = pd.DataFrame(np.random.randint(1, 10, 30).reshape(5,6),
columns=[f'col{i}' for i in range(6)])
x['col6'] = np.nan
x['col7'] = np.nan
col0 col1 col2 col3 col4 col5 col6 col7
0 6 5 1 5 2 4 NaN NaN
1 8 8 9 6 7 2 NaN NaN
2 8 3 9 6 6 6 NaN NaN
3 8 4 4 4 8 9 NaN NaN
4 5 3 4 3 8 7 NaN NaN
When calling x.shift(2, axis=1)
, col2 -> col5
shifts correctly, but col6
and col7
stays as NaN
?
How can I overwrite the NaN
in col6
and col7
values with col4
and col5
's values? Is this a bug or intended?
col0 col1 col2 col3 col4 col5 col6 col7
0 NaN NaN 6.0 5.0 1.0 5.0 NaN NaN
1 NaN NaN 8.0 8.0 9.0 6.0 NaN NaN
2 NaN NaN 8.0 3.0 9.0 6.0 NaN NaN
3 NaN NaN 8.0 4.0 4.0 4.0 NaN NaN
4 NaN NaN 5.0 3.0 4.0 3.0 NaN NaN
Upvotes: 6
Views: 3423
Reputation: 394129
It's possible this is a bug, you can use np.roll
to achieve this:
In[11]:
x.apply(lambda x: np.roll(x, 2), axis=1)
Out[11]:
col0 col1 col2 col3 col4 col5 col6 col7
0 NaN NaN 6.0 5.0 1.0 5.0 2.0 4.0
1 NaN NaN 8.0 8.0 9.0 6.0 7.0 2.0
2 NaN NaN 8.0 3.0 9.0 6.0 6.0 6.0
3 NaN NaN 8.0 4.0 4.0 4.0 8.0 9.0
4 NaN NaN 5.0 3.0 4.0 3.0 8.0 7.0
Speedwise, it's probably quicker to construct a df and reuse the existing columns and pass the result of np.roll
as the data arg to the constructor to DataFrame
:
In[12]:
x = pd.DataFrame(np.roll(x, 2, axis=1), columns = x.columns)
x
Out[12]:
col0 col1 col2 col3 col4 col5 col6 col7
0 NaN NaN 6.0 5.0 1.0 5.0 2.0 4.0
1 NaN NaN 8.0 8.0 9.0 6.0 7.0 2.0
2 NaN NaN 8.0 3.0 9.0 6.0 6.0 6.0
3 NaN NaN 8.0 4.0 4.0 4.0 8.0 9.0
4 NaN NaN 5.0 3.0 4.0 3.0 8.0 7.0
timings
In[13]:
%timeit pd.DataFrame(np.roll(x, 2, axis=1), columns = x.columns)
%timeit x.fillna(0).astype(int).shift(2, axis=1)
10000 loops, best of 3: 117 µs per loop
1000 loops, best of 3: 418 µs per loop
So constructing a new df with the result of np.roll
is quicker than first filling the NaN
values, cast to int
, and then shift
ing.
Upvotes: 4