Reputation: 151
The DataFrame has two columns A
and B
of integers.
a b
1 3
4 2
2 0
6 1
...
I need to swap in the following way:
if df.a > df.b:
temp = df.b
df.b = df.a
df.a = temp
expected output:
a b
1 3
2 4 <----
0 2 <----
1 6 <----
Basically always having in column A the smaller value of the twos.
I feel I should use loc
but I couldn't find the right way yet.
Upvotes: 4
Views: 5984
Reputation: 294258
Solution
It's as simple as
df.values.sort(1)
df
a b
0 1 3
1 2 4
2 0 2
3 1 6
What Happened
I can sort a numpy.array
in place with its sort
method. I pass the parameter axis=1
to indicate that I want to sort along the first axis (row wise). The values
attribute of a dataframe accesses the underlying numpy
array. So df.values.sort(1)
sorts the underlying values in place row wise... done.
We can be a bit more explicit with
df.values[:] = np.sort(df.values, 1)
This allows us a lot of flexibility to perform this over subsets of columns or reverse sort
df.values[:, ::-1] = np.sort(df.values, 1)
Upvotes: 3
Reputation: 210842
In [443]: df['a'], df['b'] = df.min(axis=1), df.max(axis=1)
In [444]: df
Out[444]:
a b
0 1 3
1 2 4
2 0 2
3 1 6
or
pd.DataFrame(np.sort(d.values, axis=1), d.index, d.columns)
Upvotes: 6
Reputation: 7967
Seeing the methods proposed by @JohnGait and @MaxU, I did a small speed comparison.
arr = np.random.randint(low = 100, size = (10000000, 2))
# using np.where
df = pd.DataFrame(arr, columns = ['a', 'b'])
t_0 = time.time()
df.a, df.b = np.where(df.a > df.b, [df.b, df.a], [df.a, df.b])
t_1 = time.time()
# using df.loc
df = pd.DataFrame(arr, columns = ['a', 'b'])
t_2 = time.time()
cond = df.a > df.b
df.loc[cond, ['a', 'b']] = df.loc[cond, ['b', 'a']].values
t_3 = time.time()
# using df.min
df = pd.DataFrame(arr, columns = ['a', 'b'])
t_4 = time.time()
df['a'], df['b'] = df.min(axis=1), df.max(axis=1)
t_5 = time.time()
# using np.sort
t_6 = time.time()
df_ = pd.DataFrame(np.sort(arr, axis=1), df.index, df.columns)
t_7 = time.time()
t_1 - t_0 # using np.where: 5.759037971496582
t_3 - t_2 # using .loc: 0.12156987190246582
t_5 - t_4 # using df.min: 1.0503261089324951
t_7 - t_6 # 0.20351791381835938
Although second approach is the fastest approach, the practical gain is insignificant. I am adding it here for pedantic reasons. I didn't include the sort method as I am convinced that's going be a lot slower.
EDIT
I had wrongly reported the computation time of np.where
due to a mistake I made. Corrected that (turns out its the slowest of the lot!) and added another method (following @MaxU's comment)
Upvotes: 3
Reputation: 76917
Using np.where
you can do
In [21]: df.a, df.b = np.where(df.a > df.b, [df.b, df.a], [df.a, df.b])
In [23]: df
Out[23]:
a b
0 1 3
1 2 4
2 0 2
3 1 6
Or, using .loc
In [35]: cond = df.a > df.b
In [36]: df.loc[cond, ['a', 'b']] = df.loc[cond, ['b', 'a']].values
In [37]: df
Out[37]:
a b
0 1 3
1 2 4
2 0 2
3 1 6
Or, .apply(np.sort, axis=1)
if you need smaller a
values and larger b
In [54]: df.apply(np.sort, axis=1)
Out[54]:
a b
0 1 3
1 2 4
2 0 2
3 1 6
Upvotes: 6