Reputation: 145
need some help with getting uniqued values from pandas dataframe
i have :
>>> df1
source target metric
0 acc1.yyy acx1.xxx 10000
1 acx1.xxx acc1.yyy 10000
the goal is to remove unique values based on source+target or target+source. but i can't get this with drop_duplicates.
>>> df2 = df1.drop_duplicates(subset=['source','target'])
>>> df2
source target metric
0 acc1.yyy acx1.xxx 10000
1 acx1.xxx acc1.yyy 10000
[updated]
maybe duplicate is not the correct word here so let me explain further
id source target
0 bng1.xxx.00 bdr2.xxx.00
1 bng1.xxx.00 bdr1.xxx.00
2 bdr3.yyy.00 bdr3.xxx.00
3 bdr3.xxx.00 bdr3.yyy.00
4 bdr2.xxx.00 bng1.xxx.00
5 bdr1.xxx.00 bng1.xxx.00
with above , i want to remove the entryies that have for exampl source=target and target=source.
0 and 4 = same pair
1 and 5 = same pair
2 and 3 = same pair
end goal will be to keep 0 1 2 or 4 5 3 .
Upvotes: 3
Views: 1298
Reputation: 862511
You need first sort both columns:
df1[['source','target']] = df1[['source','target']].apply(sorted,axis=1)
print (df1)
source target metric
0 acc1.yyy acx1.xxx 10000
1 acc1.yyy acx1.xxx 10000
df2 = df1.drop_duplicates(subset=['source','target'])
print (df2)
source target metric
0 acc1.yyy acx1.xxx 10000
EDIT:
It seems column source
need be changed - remove last 3 characters:
df1['source1'] = df1.source.str[:-3]
df1[['source1','target']] = df1[['source1','target']].apply(sorted,axis=1)
print (df1)
id source target source1
0 0 bng1.xxx.00-00 bng1.xxx.00 bdr2.xxx.00
1 1 bng1.xxx.00-00 bng1.xxx.00 bdr1.xxx.00
2 2 bdr3.yyy.00-00 bdr3.yyy.00 bdr3.xxx.00
3 3 bdr3.xxx.00-00 bdr3.yyy.00 bdr3.xxx.00
4 4 bdr2.xxx.00-00 bng1.xxx.00 bdr2.xxx.00
5 5 bdr1.xxx.00-00 bng1.xxx.00 bdr1.xxx.00
df2 = df1.drop_duplicates(subset=['source1','target'])
df2 = df2.drop('source1', axis=1)
print (df2)
id source target
0 0 bng1.xxx.00-00 bng1.xxx.00
1 1 bng1.xxx.00-00 bng1.xxx.00
2 2 bdr3.yyy.00-00 bdr3.yyy.00
Upvotes: 3
Reputation: 7967
Your definition of duplicates is not the same as the one that pandas uses. In pandas two rows are considered duplicates if corresponding entries are the same. In the example below row 1 and row 2 are not duplicates because they have different values for corresponding varaibles, whereas row 3 and 4 are duplicates.
df = {'source':['acc1.yyy', 'acx1.xxx', 'acc1.xxx', 'acc1.xxx'], 'target': ['acx1.xxx', 'acc1.yyy', 'acc1.yyy', 'acc1.yyy']}
df = pd.DataFrame(df)
df
# source target
# 0 acc1.yyy acx1.xxx
# 1 acx1.xxx acc1.yyy
# 2 acc1.xxx acc1.yyy
# 3 acc1.xxx acc1.yyy
df.drop_duplicates()
# source target
# 0 acc1.yyy acx1.xxx
# 1 acx1.xxx acc1.yyy
# 2 acc1.xxx acc1.yyy
For the case that you are mentioning, create a new column that is a tuple of the the source and target columns. Try the following
df.loc[:, 'src_tgt'] = pd.Series([tuple(sorted(each)) for each in list(zip(df.source.values.tolist(), df.target.values.tolist()))])
df
# source target src_tgt
# 0 acc1.yyy acx1.xxx (acc1.yyy, acx1.xxx)
# 1 acx1.xxx acc1.yyy (acx1.xxx, acc1.yyy)
# 2 acc1.xxx acc1.yyy (acc1.xxx, acc1.yyy)
# 3 acc1.xxx acc1.yyy (acc1.xxx, acc1.yyy)
df.drop_duplicates(subset=['src_tgt'])
# source target src_tgt
# 0 acc1.yyy acx1.xxx (acc1.yyy, acx1.xxx)
# 2 acc1.xxx acc1.yyy (acc1.xxx, acc1.yyy)
Upvotes: 1