Pandas dataframe unique values

Question

need some help with getting uniqued values from pandas dataframe

i have :

    >>> df1
     source    target metric
0  acc1.yyy  acx1.xxx  10000
1  acx1.xxx  acc1.yyy  10000

the goal is to remove unique values based on source+target or target+source. but i can't get this with drop_duplicates.

>>> df2 = df1.drop_duplicates(subset=['source','target'])
>>> df2
     source    target metric
0  acc1.yyy  acx1.xxx  10000
1  acx1.xxx  acc1.yyy  10000

[updated]

maybe duplicate is not the correct word here so let me explain further

id  source  target
0   bng1.xxx.00 bdr2.xxx.00
1   bng1.xxx.00 bdr1.xxx.00
2   bdr3.yyy.00 bdr3.xxx.00
3   bdr3.xxx.00 bdr3.yyy.00
4   bdr2.xxx.00 bng1.xxx.00
5   bdr1.xxx.00 bng1.xxx.00

with above , i want to remove the entryies that have for exampl source=target and target=source.

0 and 4 = same pair
1 and 5 = same pair
2 and 3 = same pair

end goal will be to keep 0 1 2 or 4 5 3 .

jezrael · Accepted Answer

You need first sort both columns:

df1[['source','target']] = df1[['source','target']].apply(sorted,axis=1)
print (df1)
     source    target  metric
0  acc1.yyy  acx1.xxx   10000
1  acc1.yyy  acx1.xxx   10000

df2 = df1.drop_duplicates(subset=['source','target'])
print (df2)
     source    target  metric
0  acc1.yyy  acx1.xxx   10000

EDIT:

It seems column source need be changed - remove last 3 characters:

df1['source1'] = df1.source.str[:-3]
df1[['source1','target']] = df1[['source1','target']].apply(sorted,axis=1)
print (df1)
   id          source       target      source1
0   0  bng1.xxx.00-00  bng1.xxx.00  bdr2.xxx.00
1   1  bng1.xxx.00-00  bng1.xxx.00  bdr1.xxx.00
2   2  bdr3.yyy.00-00  bdr3.yyy.00  bdr3.xxx.00
3   3  bdr3.xxx.00-00  bdr3.yyy.00  bdr3.xxx.00
4   4  bdr2.xxx.00-00  bng1.xxx.00  bdr2.xxx.00
5   5  bdr1.xxx.00-00  bng1.xxx.00  bdr1.xxx.00

df2 = df1.drop_duplicates(subset=['source1','target'])
df2 = df2.drop('source1', axis=1)
print (df2)
   id          source       target
0   0  bng1.xxx.00-00  bng1.xxx.00
1   1  bng1.xxx.00-00  bng1.xxx.00
2   2  bdr3.yyy.00-00  bdr3.yyy.00

Pandas dataframe unique values

Answers (2)

Related Questions