Reputation: 699
I have a pandas dataframe with inconsistent rows. In the example below key1
and key2
are two values which put together must be unique, so the couple (key1 ,key2)
is the primary key and should appear once in dataframe, while info
is a binary information of (key1 ,key2)
and could be T
or F
. Unfortunately (key1 ,key2)
are repeated in the dataframe and sometimes they have info=T
and other times info=F
, which is obviously an error.
To remove repetitions I'd like to adopt this reasoning: I'd like to count how many times (for the same couple (key1 ,key2)
) info
is T
and how many times info
is F
and
T
and F
with a function like df.drop_duplicates(subset = ["key1","key2"] , keep = "first")
in which first
should be the
row with most frequent value of info
.info=T
and 50% has info=F
, I want to remove all of
them, because I have no idea which is the right one with a function
like df.drop_duplicates(subset = ["key1","key2"] , keep = False)
.I don't know how to do this kind of filter because I want to keep 1 row if one case and 0 rows in the other, depending on the values of a specific column within groups of similar rows.
Desired behaviour
In:
key1 key2 info
0 a1 a2 T
1 a1 a2 T #duplicated row of index 0
2 a1 a2 F #similar row of indexes 0 and 1 but inconsistent with info field
3 b1 b2 T
4 b1 b2 T #duplicated row of index 3
5 b1 b3 T #not duplicated since key2 is different from indexes 3 and 4
6 c1 c2 T
7 c1 c2 F #duplicated row of index 5 but inconsistent with info field
Out:
key1 key2 info
0 a1 a2 T # for(a1,a2) T:2 and F:1
3 b1 b2 T # for(b1,b2) T:2 and F:0
5 b1 b3 T # for(b1,b3) T:1 and F:0
# no rows for (c1,c2) because T:1 and F:1
Thank you
Upvotes: 4
Views: 1282
Reputation: 173
this is just my take on it.
df = pd.DataFrame(data=[["a1", "a2", "T"],
["a1", "a2", "T"],
["a1", "a2", "F"],
["b1", "b2", "T"],
["b1", "b2", "T"],
["b1", "b3", "T"],
["c1", "c2", "T"],
["c1", "c2", "F"],], columns =["key1", "key2", "info"])
df = df.groupby(["key1", "key2", "info"]).size().reset_index()
df = df.drop_duplicates(subset=["key1", "key2", 0], keep=False)
df = df.groupby(["key1", "key2"]).max().reset_index()
df = df.drop(0, axis=1)
Upvotes: 0
Reputation: 13255
Using groupby
, idxmin
and idxmin
:
df_ = df.groupby(["key1","key2"]).info.value_counts().unstack(level=2, fill_value=0)
df_max = df_.idxmax(axis=1)
df = df_max.loc[df_max!=df_.idxmin(axis=1)].reset_index(name='info')
print(df)
key1 key2 info
0 a1 a2 T
1 b1 b2 T
2 b1 b3 T
Upvotes: 0
Reputation: 16683
Another solution is to create two temporary column that calculate the count
and max
of the groups. Then, filter out rows where the group's count
does not equal the max
(i.e. more than 50% if you only have T
and F
values) and then drop_duplicates()
. The final piece of logic is to filter out [key1 , key2]
values where 50% are T
and 50% are F
. For that, use drop_duplicates again, but on a different subset that includes count
, because if the count
is the same that means you don't know which one to select as you mentioned in your question. Lastly, drop the temporary count
column.
df['count'] = df.groupby(['key1', 'key2', 'info'])['info'].transform('count')
df['max'] = df.groupby(['key1', 'key2'])['count'].transform('max')
df = (df.loc[(df['count'] == df['max']), ['key1', 'key2', 'info','count']]
.drop_duplicates(subset=['key1', 'key2','info'])
.drop_duplicates(subset=['key1', 'key2', 'count'], keep=False)
.drop('count', axis=1))
output:
key1 key2 info
0 a1 a2 T
3 b1 b2 T
5 b1 b3 T
Upvotes: 0
Reputation: 59579
groupby
and use pd.Series.mode
to get the modal value. pd.Series.mode
will return the modes in the case of ties, so this allows us to remove these cases with drop_duplicates
as we expect only a single mode for each unique ['key1', 'key2']
.
import pandas as pd
(df.groupby(['key1', 'key2'])['info']
.apply(pd.Series.mode)
.reset_index()
.drop_duplicates(['key1', 'key2'], keep=False)
.drop(columns='level_2')
)
# key1 key2 info
#0 a1 a2 T
#1 b1 b2 T
#2 b1 b3 T
The result of the groupby
+ mode
is:
key1 key2
a1 a2 0 T
b1 b2 0 T
b3 0 T
c1 c2 0 F # Tied mode so it gets 2 rows with the last
1 T # index level indicating the # of items tied for mode.
Upvotes: 2