Reputation: 820
I have a dataframe like so:
data = {'ID':['nan, -1', '647, 47', '603, 603', '6036299, 6036299']}
df = pd.DataFrame(data)
df
ID
nan, -1
647, 47
603, 603
6036299, 6036299
How can I create a new column that shows only the unique values in column ID
?
Output:
ID unique
nan, -1 nan, -1
647, 47 647, 47
603, 603 603
6036299, 6036299 6036299
I have tried df['unique'] = df.ID.unique()
& df['unique'] = [', '.join(set(x.split(', '))) for x in df['ID']]
but they don't work.
Upvotes: 1
Views: 143
Reputation: 28699
This is just verbose, albeit another option, and not ordered:
df['unique'] = df.ID
.str.strip()
.str.split(', ')
.apply(set)
.apply(lambda x: ', '.join(x))
ID unique
0 nan, -1 -1, nan
1 647, 47 47, 647
2 603, 603 603
3 6036299, 6036299 6036299
Upvotes: 1
Reputation: 862911
If order is not important your second solution working nice:
df['unique'] = [', '.join(set(x.split(', '))) for x in df['ID']]
print (df)
ID unique
0 nan, -1 -1, nan
1 647, 47 647, 47
2 603, 603 603
3 6036299, 6036299 6036299
If order is important then use dict.fromkeys
for remove duplicates:
df['unique'] = [', '.join(dict.fromkeys(x.split(', ')).keys()) for x in df['ID']]
print (df)
ID unique
0 nan, -1 nan, -1
1 647, 47 647, 47
2 603, 603 603
3 6036299, 6036299 6036299
If want remove duplicates of all values it is more complicated - split values, reshape by stack
, remove duplicates and join groups back:
data = {'ID':['nan, -1', '647, 47', '603, 603', '6036299, 6036299, 47']}
df = pd.DataFrame(data)
df['unique11'] = [', '.join(set(x.split(', '))) for x in df['ID']]
df['unique12'] = [', '.join(dict.fromkeys(x.split(', ')).keys()) for x in df['ID']]
df['unique2'] = (df['ID'].str.split(', ', expand=True)
.stack()
.drop_duplicates()
.groupby(level=0)
.agg(', '.join))
print (df)
ID unique11 unique12 unique2
0 nan, -1 -1, nan nan, -1 nan, -1
1 647, 47 647, 47 647, 47 647, 47
2 603, 603 603 603 603
3 6036299, 6036299, 47 47, 6036299 6036299, 47 6036299
Upvotes: 3