Reputation: 211
Lets say i have the data below:
import numpy as np
import pandas as pd
data=np.array([["xxx--xxx--xxx--yyy"],
["aaa--bbb--aaa--ccc"],
["xxx--axa--axa--ccc"],
["bbb--bab--bbb--bab--tgh"]])
df = pd.DataFrame({'Practice Column': data.ravel()})
print(df)
How could i create a new column in this dataframe that will look at the strings and spit out a unique combination? The desired output would be:
Any help is appreciated. Thanks.
Upvotes: 3
Views: 603
Reputation: 75
Hope this works
df = pd.DataFrame({'Practice Column': data.ravel(),'Desired':data.unique()})
Upvotes: 1
Reputation: 402723
Consider using OrderedDict
here to drop duplicates and keep order very efficiently.
from collections import OrderedDict as o
df['Desired'] = [
'--'.join(o.fromkeys(x.split('--'), 1))
for x in df['Practice Column']]
df
Practice Column Desired
0 xxx--xxx--xxx--yyy xxx--yyy
1 aaa--bbb--aaa--ccc aaa--bbb--ccc
2 xxx--axa--axa--ccc xxx--axa--ccc
3 bbb--bab--bbb--bab--tgh bbb--bab--tgh
Performance
df_ = df
df = pd.concat([df] * 1000, ignore_index=True)
%%timeit
df['des'] = [
'--'.join(sorted(set(x.split('--')),key=x.index))
for x in df['Practice Column']]
%%timeit
df['des'] = [
'--'.join(o.fromkeys(x.split('--'), 1))
for x in df['Practice Column']
]
14.6 ms ± 392 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.18 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Haven't timed jez's second solution as it does not maintain order.
Upvotes: 2
Reputation: 863031
Use list comprehension with split
, pandas.unique
for unique with same ordering or set
with sorted
and last join
together:
df['des'] = ['--'.join(pd.unique(x.split('--'))) for x in df['Practice Column']]
Or:
df['des'] = ['--'.join(sorted(set(x.split('--')),key=x.index)) for x in df['Practice Column']]
print (df)
Practice Column des
0 xxx--xxx--xxx--yyy xxx--yyy
1 aaa--bbb--aaa--ccc aaa--bbb--ccc
2 xxx--axa--axa--ccc xxx--axa--ccc
3 bbb--bab--bbb--bab--tgh bbb--bab--tgh
If ordering is not important solution is simplier:
df['des'] = ['--'.join(set(x.split('--'))) for x in df['Practice Column']]
print (df)
Practice Column des
0 xxx--xxx--xxx--yyy yyy--xxx
1 aaa--bbb--aaa--ccc ccc--bbb--aaa
2 xxx--axa--axa--ccc ccc--axa--xxx
3 bbb--bab--bbb--bab--tgh bab--tgh--bbb
Upvotes: 5