Reputation: 14651
I have the following pandas dataframe:
column_01 column_02 value
ccc aaa 1
bbb ddd 34
ddd aaa 98
I need to re-organise the dataframe such that column_01
contains which ever value comes first alphabetically between column_01
and column_02
. The output of the above example would be:
column_01 column_02 value
aaa ccc 1
bbb ddd 34
aaa ddd 98
I could obviously do this by iterating over the dataframe one row at a time, comparing column_01
to column_02
to see which comes first alphabetically and swapping them if necessary. The only problem with this is that the dataframe is quite big (over 1million rows), so this isn't a very efficient way to do this.
Is there a way to do this without iterating over every row individually?
Upvotes: 3
Views: 1353
Reputation: 863166
You can use:
df[['column_01','column_02']] =
df[['column_01','column_02']].apply(lambda x: sorted(x.values), axis=1)
print (df)
column_01 column_02 value
0 aaa ccc 1
1 bbb ddd 34
2 aaa ddd 98
Another solutions:
df[['column_01','column_02']] = pd.DataFrame(np.sort(df[['column_01','column_02']].values),
index=df.index, columns=['column_01','column_02'])
only with numpy array:
df[['column_01','column_02']] = np.sort(df[['column_01','column_02']].values)
print (df)
column_01 column_02 value
0 aaa ccc 1
1 bbb ddd 34
2 aaa ddd 98
Second solution is faster, because apply
use loops:
df = pd.concat([df]*1000).reset_index(drop=True)
In [177]: %timeit df[['column_01','column_02']] = pd.DataFrame(np.sort(df[['column_01','column_02']].values), index=df.index, columns=['column_01','column_02'])
1000 loops, best of 3: 1.36 ms per loop
In [182]: %timeit df[['column_01','column_02']] = np.sort(df[['column_01','column_02']].values)
1000 loops, best of 3: 1.54 ms per loop
In [178]: %timeit df[['column_01','column_02']] = (df[['column_01','column_02']].apply(lambda x: sorted(x.values), axis=1))
1 loop, best of 3: 291 ms per loop
Upvotes: 2