Mazz
Mazz

Reputation: 820

Get unique strings from multiple columns in Pandas Dataframe

I have a dataframe like so:

data = {'ID':['nan, -1', '647, 47', '603, 603', '6036299, 6036299']} 

df = pd.DataFrame(data)
df


ID
nan, -1
647, 47
603, 603
6036299, 6036299

How can I create a new column that shows only the unique values in column ID?

Output:

 ID                       unique
nan, -1                    nan, -1
647, 47                    647, 47
603, 603                   603
6036299, 6036299           6036299

I have tried df['unique'] = df.ID.unique() & df['unique'] = [', '.join(set(x.split(', '))) for x in df['ID']] but they don't work.

Upvotes: 1

Views: 143

Answers (2)

sammywemmy
sammywemmy

Reputation: 28699

This is just verbose, albeit another option, and not ordered:

df['unique'] = df.ID
              .str.strip()
              .str.split(', ')
              .apply(set)
              .apply(lambda x: ', '.join(x))

       ID                unique
0   nan, -1              -1, nan
1   647, 47              47, 647
2   603, 603             603
3   6036299, 6036299    6036299

Upvotes: 1

jezrael
jezrael

Reputation: 862911

If order is not important your second solution working nice:

df['unique'] = [', '.join(set(x.split(', '))) for x in df['ID']]
print (df)
                 ID   unique
0           nan, -1  -1, nan
1           647, 47  647, 47
2          603, 603      603
3  6036299, 6036299  6036299

If order is important then use dict.fromkeys for remove duplicates:

df['unique'] = [', '.join(dict.fromkeys(x.split(', ')).keys()) for x in df['ID']]
print (df)
                 ID   unique
0           nan, -1  nan, -1
1           647, 47  647, 47
2          603, 603      603
3  6036299, 6036299  6036299

If want remove duplicates of all values it is more complicated - split values, reshape by stack, remove duplicates and join groups back:

data = {'ID':['nan, -1', '647, 47', '603, 603', '6036299, 6036299, 47']} 

df = pd.DataFrame(data)

df['unique11'] = [', '.join(set(x.split(', '))) for x in df['ID']]
df['unique12'] = [', '.join(dict.fromkeys(x.split(', ')).keys()) for x in df['ID']]
df['unique2'] = (df['ID'].str.split(', ', expand=True)
                        .stack()
                        .drop_duplicates()
                        .groupby(level=0)
                        .agg(', '.join))
print (df)

                     ID     unique11     unique12  unique2
0               nan, -1      -1, nan      nan, -1  nan, -1
1               647, 47      647, 47      647, 47  647, 47
2              603, 603          603          603      603
3  6036299, 6036299, 47  47, 6036299  6036299, 47  6036299

Upvotes: 3

Related Questions