Cmarv
Cmarv

Reputation: 145

Pandas dataframe unique values

need some help with getting uniqued values from pandas dataframe

i have :

    >>> df1
     source    target metric
0  acc1.yyy  acx1.xxx  10000
1  acx1.xxx  acc1.yyy  10000

the goal is to remove unique values based on source+target or target+source. but i can't get this with drop_duplicates.

>>> df2 = df1.drop_duplicates(subset=['source','target'])
>>> df2
     source    target metric
0  acc1.yyy  acx1.xxx  10000
1  acx1.xxx  acc1.yyy  10000

[updated]

maybe duplicate is not the correct word here so let me explain further

id  source  target
0   bng1.xxx.00 bdr2.xxx.00
1   bng1.xxx.00 bdr1.xxx.00
2   bdr3.yyy.00 bdr3.xxx.00
3   bdr3.xxx.00 bdr3.yyy.00
4   bdr2.xxx.00 bng1.xxx.00
5   bdr1.xxx.00 bng1.xxx.00

with above , i want to remove the entryies that have for exampl source=target and target=source.

0 and 4 = same pair
1 and 5 = same pair
2 and 3 = same pair

end goal will be to keep 0 1 2 or 4 5 3 .

Upvotes: 3

Views: 1298

Answers (2)

jezrael
jezrael

Reputation: 862511

You need first sort both columns:

df1[['source','target']] = df1[['source','target']].apply(sorted,axis=1)
print (df1)
     source    target  metric
0  acc1.yyy  acx1.xxx   10000
1  acc1.yyy  acx1.xxx   10000

df2 = df1.drop_duplicates(subset=['source','target'])
print (df2)
     source    target  metric
0  acc1.yyy  acx1.xxx   10000

EDIT:

It seems column source need be changed - remove last 3 characters:

df1['source1'] = df1.source.str[:-3]
df1[['source1','target']] = df1[['source1','target']].apply(sorted,axis=1)
print (df1)
   id          source       target      source1
0   0  bng1.xxx.00-00  bng1.xxx.00  bdr2.xxx.00
1   1  bng1.xxx.00-00  bng1.xxx.00  bdr1.xxx.00
2   2  bdr3.yyy.00-00  bdr3.yyy.00  bdr3.xxx.00
3   3  bdr3.xxx.00-00  bdr3.yyy.00  bdr3.xxx.00
4   4  bdr2.xxx.00-00  bng1.xxx.00  bdr2.xxx.00
5   5  bdr1.xxx.00-00  bng1.xxx.00  bdr1.xxx.00

df2 = df1.drop_duplicates(subset=['source1','target'])
df2 = df2.drop('source1', axis=1)
print (df2)
   id          source       target
0   0  bng1.xxx.00-00  bng1.xxx.00
1   1  bng1.xxx.00-00  bng1.xxx.00
2   2  bdr3.yyy.00-00  bdr3.yyy.00

Upvotes: 3

Clock Slave
Clock Slave

Reputation: 7967

Your definition of duplicates is not the same as the one that pandas uses. In pandas two rows are considered duplicates if corresponding entries are the same. In the example below row 1 and row 2 are not duplicates because they have different values for corresponding varaibles, whereas row 3 and 4 are duplicates.

df = {'source':['acc1.yyy', 'acx1.xxx', 'acc1.xxx', 'acc1.xxx'], 'target': ['acx1.xxx', 'acc1.yyy', 'acc1.yyy', 'acc1.yyy']}
df = pd.DataFrame(df)
df
     # source    target
# 0  acc1.yyy  acx1.xxx
# 1  acx1.xxx  acc1.yyy
# 2  acc1.xxx  acc1.yyy
# 3  acc1.xxx  acc1.yyy
df.drop_duplicates()
     # source    target
# 0  acc1.yyy  acx1.xxx
# 1  acx1.xxx  acc1.yyy
# 2  acc1.xxx  acc1.yyy

For the case that you are mentioning, create a new column that is a tuple of the the source and target columns. Try the following

df.loc[:, 'src_tgt'] = pd.Series([tuple(sorted(each)) for each in list(zip(df.source.values.tolist(), df.target.values.tolist()))])
df
     # source    target               src_tgt
# 0  acc1.yyy  acx1.xxx  (acc1.yyy, acx1.xxx)
# 1  acx1.xxx  acc1.yyy  (acx1.xxx, acc1.yyy)
# 2  acc1.xxx  acc1.yyy  (acc1.xxx, acc1.yyy)
# 3  acc1.xxx  acc1.yyy  (acc1.xxx, acc1.yyy)
df.drop_duplicates(subset=['src_tgt'])
     # source    target               src_tgt
# 0  acc1.yyy  acx1.xxx  (acc1.yyy, acx1.xxx)
# 2  acc1.xxx  acc1.yyy  (acc1.xxx, acc1.yyy)

Upvotes: 1

Related Questions