Reputation: 399
I have a dataframe with two columns containing lists. I want to combine these columns into a single column and merge the lists into a single list. Also this list should only contain unique values from the original lists.
I've tried merging them using df['E']=df[['B','C']].values.tolist()
.
However this creates a single column with values comprising two lists.
The dataframe looks something like this:
A B C D
a1 [b1,b2] [c1,b1] d1
a2 [b1,b1] [b3] d2
a3 [b2] [b2,b2] d3
The final dataframe should look like this:
A B C D E
a1 [b1,b2] [c1,b1] d1 [b1,b2,c1]
a2 [b1,b1] [b3] d2 [b1,b3]
a3 [b2] [b2,b2] d3 [b2]
Edit: The values within the lists of the dataframe are strings.
Upvotes: 6
Views: 5069
Reputation: 323226
IIUC
df['E']=(df.B+df.C).map(set).map(list)
df
Out[81]:
A B C D E
0 a1 [b1, b2] [c1, b1] d1 [b2, b1, c1]
1 a2 [b1, b1] [b3] d2 [b3, b1]
2 a3 [b2] [b2, b2] d3 [b2]
Upvotes: 9
Reputation: 164623
You can use itertools.chain
with dict.fromkeys
within a list comprehension. Note by your choice of lists within series you are losing all vectorisation benefits.
from itertools import chain
df = pd.DataFrame({'A': ['a1', 'a2', 'a3'],
'B': [['b1', 'b2'], ['b1', 'b1'], ['b2']],
'C': [['c1', 'b1'], ['b3'], ['b2', 'b2']],
'D': ['d1', 'd2', 'd3']})
df['E'] = [list(dict.fromkeys(chain(x, y))) for x, y in zip(df['B'], df['C'])]
print(df)
A B C D E
0 a1 [b1, b2] [c1, b1] d1 [b1, b2, c1]
1 a2 [b1, b1] [b3] d2 [b1, b3]
2 a3 [b2] [b2, b2] d3 [b2]
The benefit of this method in Python v3.7+ (and unofficially in v3.6 as a CPython implementation detail) is ordering is preserved, as dictionaries are insertion-ordered.
Upvotes: 3
Reputation: 61910
If order does not matter, set will do the job:
import pandas as pd
data = [['a1', ['b1', 'b2'], ['c1', 'b1'], 'd1'],
['a2', ['b1', 'b1'], ['b3'], 'd2'],
['a3', ['b2'], ['b2', 'b2'], 'd3']]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'])
def uniques(xs):
return list(set(xi for x in xs for xi in x))
df['E'] = df[['B', 'C']].apply(uniques, axis=1)
print(df)
Output
A B C D E
0 a1 [b1, b2] [c1, b1] d1 [b1, b2, c1]
1 a2 [b1, b1] [b3] d2 [b1, b3]
2 a3 [b2] [b2, b2] d3 [b2]
If order does matter use OrderedDict:
import pandas as pd
from collections import OrderedDict
data = [['a1', ['b1', 'b2'], ['c1', 'b1'], 'd1'],
['a2', ['b1', 'b1'], ['b3'], 'd2'],
['a3', ['b2'], ['b2', 'b2'], 'd3']]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'])
def uniques(xs):
return list(OrderedDict().fromkeys(xi for x in xs for xi in x))
df['E'] = df[['B', 'C']].apply(uniques, axis=1)
Output
A B C D E
0 a1 [b1, b2] [c1, b1] d1 [b1, b2, c1]
1 a2 [b1, b1] [b3] d2 [b1, b3]
2 a3 [b2] [b2, b2] d3 [b2]
Upvotes: 2