StefanK
StefanK

Reputation: 2180

How to create sorted list of values from mutiple columns in pandas?

I have a data frame with column A and column B that can have same value pairs when sorted. I would like to deduplicate these columns as I don't care about the order in my application.

Here is a sample dataframe:

import pandas as pd
df = pd.DataFrame({'col1':[1, 2, 3], 'col2':[2, 1, 4]})
print(df)

This is how the dataframe looks like:

index col1  col2 

0     1     2 

1     2     1 

2     3     4

What I want to achieve is create a new column that will have sorted list of first two values for every row so I will be able to deduplicate the dataframe based on this column.

The key_column would look like this:

0   [1, 2]

1   [1, 2]

2   [3, 4]

I would then use df.drop_duplicates(col3)

I have an idea that I should either use .apply or .map and maybe some lambda function, but nothing I tried worked so far:

df.apply(lambda row: sorted([row[0], row[1]]), axis=1) # this sorts the column values in place but doesn't create a new column with a list
sorted([df['col1'], df['col2']]) # returns error The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
df.map(sorted) # dataframe object has no attribute map
df[['col1', 'col2']].apply(lambda x:
    sorted([','.join(x.astype(int).astype(str))]), axis=1) # creates a list but is not sorted

Thank you for help, I would like to see a solution that is also explained - why it works.

Upvotes: 0

Views: 1219

Answers (2)

Aleksandr Borisov
Aleksandr Borisov

Reputation: 2212

Three steps:

df['x'] = df.apply(lambda x: tuple(sorted(x)), axis=1)
df = df.drop_duplicates('x')
del df['x']

Upvotes: 2

cs95
cs95

Reputation: 402814

Option 1

Use df.apply and pass sorted:

In [1234]: df['col3'] = df.apply(tuple, 1).apply(sorted).apply(tuple)

In [1235]: df.drop_duplicates('col3')
Out[1235]: 
   col1  col2    col3
0     1     2  (1, 2)
2     3     4  (3, 4)

Option 2

Call np.sort on df.values and then assign the result to a new column.

In [1208]: df['col3'] = pd.Series([tuple(x) for x in np.sort(df.values, 1)]); df
Out[1208]: 
   col1  col2    col3
0     1     2  (1, 2)
1     2     1  (1, 2)
2     3     4  (3, 4)

In [1210]: df.drop_duplicates('col3')
Out[1210]: 
   col1  col2    col3
0     1     2  (1, 2)
2     3     4  (3, 4)

Upvotes: 5

Related Questions