Kashyap
Kashyap

Reputation: 83

Merge dataframes while removing duplicates

I cant get my head around Merge and joins.

I have an initial data frame, that looks like below:

 Car    Units Month
 BMW     800   Jan
 Lexus   150   Jan
 Audi    210   Feb
 Mustang 90    Mar
 Bentley 110   Feb
 Jaguar  90    Jan

I will successively get more dataframes to append to this initial df.

And I will have to append only those values which are unique in 'Car' column and also 'Month' column.

Let's say I got another dataframe to append:

Car   Units Month
BMW     130   Jan
Tata    150   Jan
Maruti  110   Mar
Audi    250   Mar

Desired output: When I append 2nd dataframe to the first one, 'Tata', 'Maruti', 'Audi' should get appended. 'BMW' should not get appended.

The code I am using is:

import pandas as pd


df1 = pd.DataFrame(
   {
      "Car": ['BMW', 'Lexus', 'Audi', 'Mustang', 'Bentley', 'Jaguar'],"Units": [800, 150, 210, 90, 110, 90], "Month": ['Jan','Jan', 'Feb', 'Mar', 'Feb', 'Jan']
   }
)

df2 = pd.DataFrame(
   {
      "Car": ['BMW', 'Tata', 'Maruti', 'Audi'],"Units": [130, 150, 110, 250], "Month": ['Jan','Jan','Mar', 'Mar']
   }
)


df1 = pd.merge(df1, df2, on=['Car', 'Month'], how="left", indicator=False)

But somehow it doesn't work.

Upvotes: 0

Views: 76

Answers (1)

Mabel Villalba
Mabel Villalba

Reputation: 2598

Using "left" will only affect to using just the keys on the left for ['Car', 'Month'].

One solution is to use outer to have all the combinations of ['Car', 'Month'] keys. I set the indicator to true the origin of the value and also the left to have no suffix so we can take advantage of this later:

joined=pd.merge(df1, df2, on=['Car', 'Month'],
                how="outer", indicator=True, suffixes=["", "_y"])
>> joined

       Car  Units Month  Units_y      _merge
0      BMW  800.0   Jan    130.0        both
1    Lexus  150.0   Jan      NaN   left_only
2     Audi  210.0   Feb      NaN   left_only
3  Mustang   90.0   Mar      NaN   left_only
4  Bentley  110.0   Feb      NaN   left_only
5   Jaguar   90.0   Jan      NaN   left_only
6     Tata    NaN   Jan    150.0  right_only
7   Maruti    NaN   Mar    110.0  right_only
8     Audi    NaN   Mar    250.0  right_only

Selecting the units columns:

units = joined.filter(regex="^Units", axis=1)

>> units

   Units  Units_y
0  800.0    130.0
1  150.0      NaN
2  210.0      NaN
3   90.0      NaN
4  110.0      NaN
5   90.0      NaN
6    NaN    150.0
7    NaN    110.0
8    NaN    250.0

Then, since you want the results on the left to take precedence, you could fill the Nan in the horizontal direction, so if there are any values on Units (the value coming from the left), they will be preserved:

 filled_units = units.fillna(axis=1, method='bfill')

 >> filled_units

   Units  Units_y
0  800.0    130.0
1  150.0      NaN
2  210.0      NaN
3   90.0      NaN
4  110.0      NaN
5   90.0      NaN
6  150.0    150.0
7  110.0    110.0
8  250.0    250.0

Finally, we can slice filled_units and combine it with the values for Car and Month in joined:

pd.concat([joined[['Car', 'Month']], filled_units["Units"]], axis=1)

       Car Month  Units
0      BMW   Jan  800.0
1    Lexus   Jan  150.0
2     Audi   Feb  210.0
3  Mustang   Mar   90.0
4  Bentley   Feb  110.0
5   Jaguar   Jan   90.0
6     Tata   Jan  150.0
7   Maruti   Mar  110.0
8     Audi   Mar  250.0

Hope it helps

Upvotes: 1

Related Questions