remove rows of dataframe based on difference with other dataframes

Question

I have 1000 observations that I have analysed using 4 different models. The results of these models are stored in 4 data frames, that look something like this:

import pandas as pd

model_1 = pd.DataFrame(data={"observation_id": [0, 1, ..., 1000], 
                             "param_1": [0, 4, ..., 2]})

model_2 = pd.DataFrame(data={"observation_id": [0, 2, ..., 1000], 
                             "param_1": [1, 3, ..., 1]})

similarly for the other two (there are actually many more parameters in the data frames).

The "observation_id" uniquely matches the results of a model to the observation. So an observation_id of 0 gives the results of the four models for that particular observation.

The problem is that there are gaps in the modelled data. model_1 might be missing observation_id 45 and model_3 might be missing observation_id 721.

How can I create four new data frames containing just the results related to observation_ids that are common to all 4 datasets?

BENY · Accepted Answer

Here is one way using merge with how='left'

from functools import reduce
p1 = reduce(lambda left,right: pd.merge(left,right,on='observation_id',how='left'), [df1,df2,df3,df4])

If you only want one df not four

from functools import reduce
df = reduce(lambda left,right: pd.merge(left,right,on='observation_id',how='inner'), [df1,df2,df3,df4])

remove rows of dataframe based on difference with other dataframes

Answers (1)

Related Questions