I have a pandas dataframe with inconsistent rows . In the example below key1 and key2 are two values which put together must be unique, so the couple (key1 ,key2) is the primary key and should appear once in dataframe, while info is a binary information of (key1 ,key2) and could be T or F . Unfortunately (key1 ,key2) are repeated in the dataframe and sometimes they have info=T and other times info=F , which is obviously an error. To remove repetitions I'd like to adopt this reasoning: I'd like to count how many times (for the same couple (key1 ,key2) ) info is T and how many times info is F and if the frequencies are different (most of the time) keep only one of the rows that have the most frequent value between T and F with a function like df.drop_duplicates(subset = ["key1","key2"] , keep = "first") in which first should be the row with most frequent value of info . If instead 50% of rows has info=T and 50% has info=F , I want to remove all of them , because I have no idea which is the right one with a function like df.drop_duplicates(subset = ["key1","key2"] , keep = False) . I don't know how to do this kind of filter because I want to keep 1 row if one case and 0 rows in the other, depending on the values of a specific column within groups of similar rows. Desired behaviour In: key1 key2 info 0 a1 a2 T 1 a1 a2 T #duplicated row of index 0 2 a1 a2 F #similar row of indexes 0 and 1 but inconsistent with info field 3 b1 b2 T 4 b1 b2 T #duplicated row of index 3 5 b1 b3 T #not duplicated since key2 is different from indexes 3 and 4 6 c1 c2 T 7 c1 c2 F #duplicated row of index 5 but inconsistent with info field Out: key1 key2 info 0 a1 a2 T # for(a1,a2) T:2 and F:1 3 b1 b2 T # for(b1,b2) T:2 and F:0 5 b1 b3 T # for(b1,b3) T:1 and F:0 # no rows for (c1,c2) because T:1 and F:1 Thank you

Reputation: 699

Remove rows with the least frequent value in a column grouping by other columns in a Pandas Dataframe

I have a pandas dataframe with inconsistent rows. In the example below key1 and key2 are two values which put together must be unique, so the couple (key1 ,key2) is the primary key and should appear once in dataframe, while info is a binary information of (key1 ,key2) and could be T or F. Unfortunately (key1 ,key2) are repeated in the dataframe and sometimes they have info=T and other times info=F, which is obviously an error.

To remove repetitions I'd like to adopt this reasoning: I'd like to count how many times (for the same couple (key1 ,key2)) info is T and how many times info is F and

if the frequencies are different (most of the time) keep only one of the rows that have the most frequent value between T and F with a function like df.drop_duplicates(subset = ["key1","key2"] , keep = "first") in which first should be the row with most frequent value of info.
If instead 50% of rows has info=T and 50% has info=F, I want to remove all of them, because I have no idea which is the right one with a function like df.drop_duplicates(subset = ["key1","key2"] , keep = False).

I don't know how to do this kind of filter because I want to keep 1 row if one case and 0 rows in the other, depending on the values of a specific column within groups of similar rows.

Desired behaviour

In:

     key1  key2    info
0    a1    a2      T 
1    a1    a2      T #duplicated row of index 0
2    a1    a2      F #similar row of indexes 0 and 1 but inconsistent with info field
3    b1    b2      T 
4    b1    b2      T #duplicated row of index 3
5    b1    b3      T #not duplicated since key2 is different from indexes 3 and 4
6    c1    c2      T 
7    c1    c2      F #duplicated row of index 5 but inconsistent with info field

Out:

     key1  key2     info
0    a1    a2       T # for(a1,a2) T:2 and F:1
3    b1    b2       T # for(b1,b2) T:2 and F:0
5    b1    b3       T # for(b1,b3) T:1 and F:0
                    # no rows for (c1,c2) because T:1 and F:1

Thank you

Upvotes: 4

Answers (4)

Active_Learner

Reputation: 173

this is just my take on it.

df = pd.DataFrame(data=[["a1", "a2", "T"],
                          ["a1", "a2", "T"],
                          ["a1", "a2", "F"],
                          ["b1", "b2", "T"],
                          ["b1", "b2", "T"],
                          ["b1", "b3", "T"],
                          ["c1", "c2", "T"],
                          ["c1", "c2", "F"],], columns =["key1", "key2", "info"])
df = df.groupby(["key1", "key2", "info"]).size().reset_index()
df = df.drop_duplicates(subset=["key1", "key2", 0], keep=False)
df = df.groupby(["key1", "key2"]).max().reset_index()
df = df.drop(0, axis=1)

Upvotes: 0

Space Impact

Reputation: 13255

Using groupby, idxmin and idxmin:

df_ = df.groupby(["key1","key2"]).info.value_counts().unstack(level=2, fill_value=0)
df_max = df_.idxmax(axis=1)
df = df_max.loc[df_max!=df_.idxmin(axis=1)].reset_index(name='info')

print(df)
  key1 key2 info
0   a1   a2    T
1   b1   b2    T
2   b1   b3    T

Upvotes: 0

David Erickson

Reputation: 16683

Another solution is to create two temporary column that calculate the count and max of the groups. Then, filter out rows where the group's count does not equal the max (i.e. more than 50% if you only have T and F values) and then drop_duplicates(). The final piece of logic is to filter out [key1 , key2] values where 50% are T and 50% are F. For that, use drop_duplicates again, but on a different subset that includes count, because if the count is the same that means you don't know which one to select as you mentioned in your question. Lastly, drop the temporary count column.

df['count'] = df.groupby(['key1', 'key2', 'info'])['info'].transform('count')
df['max'] = df.groupby(['key1', 'key2'])['count'].transform('max')
df = (df.loc[(df['count'] == df['max']), ['key1', 'key2', 'info','count']]
        .drop_duplicates(subset=['key1', 'key2','info'])
        .drop_duplicates(subset=['key1', 'key2', 'count'], keep=False)
        .drop('count', axis=1))

output:

    key1    key2    info
0   a1      a2      T
3   b1      b2      T
5   b1      b3      T

Upvotes: 0

ALollz

Reputation: 59579

groupby and use pd.Series.mode to get the modal value. pd.Series.mode will return the modes in the case of ties, so this allows us to remove these cases with drop_duplicates as we expect only a single mode for each unique ['key1', 'key2'].

import pandas as pd

(df.groupby(['key1', 'key2'])['info']
   .apply(pd.Series.mode)
   .reset_index()
   .drop_duplicates(['key1', 'key2'], keep=False)
   .drop(columns='level_2')
)

#  key1 key2 info
#0   a1   a2    T
#1   b1   b2    T
#2   b1   b3    T

The result of the groupby + mode is:

key1  key2   
a1    a2    0    T
b1    b2    0    T
      b3    0    T
c1    c2    0    F   # Tied mode so it gets 2 rows with the last
            1    T   # index level indicating the # of items tied for mode.

Upvotes: 2

Remove rows with the least frequent value in a column grouping by other columns in a Pandas Dataframe

Answers (4)

Related Questions