jamo
jamo

Reputation: 91

check similarity of some values in data frame based on id column

I have a df like this:

ix  y1  y2  id
ix1 X   X   AP10579
ix2 E   E   AP17998
ix3 C   C   AP283716
ix4 C   C   AP283716
ix5 E   E   AP17998
ix6 T   T   AP21187
ix7 X   Z   AP10579
ix8 T   K   AP21187
ix9 E   E   AP12457
ix10 C  C   Ap87930

in id column, we have two ids which are similar (f.x. ix1 & ix7 have the same id, ix2 & ix5, and so on) . also we have some unique ids,
I want to check if y1+y2 of each of these two ids are the same or not, and if they are the same so move one of them in a new df, also move every unique id,

so I should have a new df, df_new, like this:

ix  y1  y2  id
ix2 E   E   AP17998
ix3 C   C   AP283716
ix9 E   E   AP12457
ix10 C  C   Ap87930

any suggestions is appreciated.

 df = {
    'ix': ['ix1','ix2','ix3','ix4','ix5','ix6','ix7','ix8','ix9','ix10'],
    'y1': ['X','E','C','C','E','T','X','T', 'E','C'],
    'y2': ['X','E','C','C','E','T','Z','K', 'E','C'],
    'id': ['AP10579','AP17998','AP283716','AP283716','AP17998','AP21187','AP10579','AP21187', 'AP12457', 'Ap87930']
}

Upvotes: 1

Views: 42

Answers (1)

Andrea
Andrea

Reputation: 3077

This is a possible approach:

df = pd.DataFrame({
    'ix': ['ix1','ix2','ix3','ix4','ix5','ix6','ix7','ix8','ix9','ix10'],
    'y1': ['X','E','C','C','E','T','X','T', 'E','C'],
    'y2': ['X','E','C','C','E','T','Z','K', 'E','C'],
    'id': ['AP10579','AP17998','AP283716','AP283716','AP17998','AP21187','AP10579','AP21187', 'AP12457', 'Ap87930']
})

def filter_df(g):
    if len(g) == 1:
        return g.iloc[0]
    if g.y1.unique().size + g.y2.unique().size == 2:
        return g.iloc[0]


df.groupby('id').agg(filter_df).dropna().reset_index()

output:

         id    ix y1 y2
0   AP12457   ix9  E  E
1   AP17998   ix2  E  E
2  AP283716   ix3  C  C
3   Ap87930  ix10  C  C

Upvotes: 1

Related Questions