Reputation: 175

filtering nearly duplicate data based on a column condition

I have a pandas dataframe, here is an example:

names  subject     date       marks
A        X      15-05-2021     10
A        X      22-05-2021     12
A        X      29-05-2021     11
A        Y      15-05-2021     15
A        Y      22-05-2021     10
B        P      22-05-2021     17
B        P      29-05-2021     16
B        Q      29-05-2021     14
B        R      22-05-2021     16

I need to filter out all names, dates and subjects for which data is available for "29-05-2021".

this is what I need:

names  subject     date       marks
A        X      15-05-2021     10
A        X      22-05-2021     12
A        X      29-05-2021     11
B        P      22-05-2021     17
B        P      29-05-2021     16
B        Q      29-05-2021     14

I have figured out how to do this, but I think it can be improved.

this is what I am doing as of now:

df = pd.read_excel("",name="sheet")
end_date = "29-05-2021"
end_date = pd.to_datetime(end_date)
mask = df["date"] == end_date
temp_df = df.loc[mask]
temp_df = temp_df[["names","subject"]]
result_df = pd.merge(df,temp_df, on = ["names","subject"],how="right"]

is there any scope of improvement?

Upvotes: 0

Answers (2)

jezrael

Reputation: 862911

I think for performance is your solution good, a bit simplified with remove on (because joined by intersection of columns names, here name, subject) and need inner join, which is default, so also should be omitted:

end_date = "29-05-2021"

mask = df["date"] == pd.to_datetime(end_date)
result_df = pd.merge(df,df.loc[mask, ["names","subject"]])
print (result_df)
  names subject       date  marks
0     A       X 2021-05-15     10
1     A       X 2021-05-22     12
2     A       X 2021-05-29     11
3     B       P 2021-05-22     17
4     B       P 2021-05-29     16
5     B       Q 2021-05-29     14

Alternative solution with GroupBy.transform is a bit slowier if large DataFrames:

end_date = "29-05-2021"

mask = df["date"] == pd.to_datetime(end_date)
df = df[df.assign(m = mask).groupby(['names','subject'])['m'].transform('any')]

Upvotes: 1

wwnde

Reputation: 26676

Just groupby, and filter any date that has your specs.

df.groupby(['names','subject']).filter(lambda x:(x['date']=='29-05-2021').any())

Upvotes: 1

filtering nearly duplicate data based on a column condition

Answers (2)

Related Questions