J. Dykstra
J. Dykstra

Reputation: 211

Get unique strings in pandas column by delimiter

Lets say i have the data below:

import numpy as np
import pandas as pd

data=np.array([["xxx--xxx--xxx--yyy"],
              ["aaa--bbb--aaa--ccc"],
              ["xxx--axa--axa--ccc"],
              ["bbb--bab--bbb--bab--tgh"]])

df = pd.DataFrame({'Practice Column': data.ravel()})

print(df)

How could i create a new column in this dataframe that will look at the strings and spit out a unique combination? The desired output would be:

enter image description here

Any help is appreciated. Thanks.

Upvotes: 3

Views: 603

Answers (3)

Harshitha C
Harshitha C

Reputation: 75

Hope this works

df = pd.DataFrame({'Practice Column': data.ravel(),'Desired':data.unique()})

Upvotes: 1

cs95
cs95

Reputation: 402723

Consider using OrderedDict here to drop duplicates and keep order very efficiently.

from collections import OrderedDict as o

df['Desired'] = [
    '--'.join(o.fromkeys(x.split('--'), 1)) 
    for x in df['Practice Column']]
df

           Practice Column        Desired
0       xxx--xxx--xxx--yyy       xxx--yyy
1       aaa--bbb--aaa--ccc  aaa--bbb--ccc
2       xxx--axa--axa--ccc  xxx--axa--ccc
3  bbb--bab--bbb--bab--tgh  bbb--bab--tgh

Performance

df_ = df
df = pd.concat([df] * 1000, ignore_index=True)

%%timeit
df['des'] = [
     '--'.join(sorted(set(x.split('--')),key=x.index)) 
     for x in df['Practice Column']]

%%timeit
df['des'] = [
    '--'.join(o.fromkeys(x.split('--'), 1)) 
    for x in df['Practice Column']
]

14.6 ms ± 392 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.18 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Haven't timed jez's second solution as it does not maintain order.

Upvotes: 2

jezrael
jezrael

Reputation: 863031

Use list comprehension with split, pandas.unique for unique with same ordering or set with sorted and last join together:

df['des'] = ['--'.join(pd.unique(x.split('--'))) for x in df['Practice Column']]

Or:

df['des'] = ['--'.join(sorted(set(x.split('--')),key=x.index)) for x in df['Practice Column']]

print (df)
           Practice Column            des
0       xxx--xxx--xxx--yyy       xxx--yyy
1       aaa--bbb--aaa--ccc  aaa--bbb--ccc
2       xxx--axa--axa--ccc  xxx--axa--ccc
3  bbb--bab--bbb--bab--tgh  bbb--bab--tgh

If ordering is not important solution is simplier:

df['des'] = ['--'.join(set(x.split('--'))) for x in df['Practice Column']]
print (df)
           Practice Column            des
0       xxx--xxx--xxx--yyy       yyy--xxx
1       aaa--bbb--aaa--ccc  ccc--bbb--aaa
2       xxx--axa--axa--ccc  ccc--axa--xxx
3  bbb--bab--bbb--bab--tgh  bab--tgh--bbb

Upvotes: 5

Related Questions