kutlus
kutlus

Reputation: 361

pandas merged data length

I have two data frames, each has one column with the same values (and equal length) but different order as in simplified example;

df1=pd.DataFrame(['a','b','c','d','e'],columns=['names'])
df2=pd.DataFrame(['b','e','a','c','d'],columns=['names'])

I want to know the corresponding index of each row in df1 in df2 and do;

df= pd.merge(df1.reset_index(), df2.reset_index(), on=['names'])

this works and as expected for this example,the length of the data frames are equal len(df1)=len(df2)=len(df)

However in my real data, len(df1)=len(df2)=1714 and len(df)=1676

I am puzzled, how is this possible?

I just did an experiment and added duplicates.

df1=pd.DataFrame(['e','a','b','c','d','e'],columns=['names'])
df2=pd.DataFrame(['b','e','a','e','c','d'],columns=['names'])
df= pd.merge(df1.reset_index(), df2.reset_index(), on=['names'])

This gives len(df)=8 larger than len(df1)=len(df2)=6.

But in my real data df is smaller than individual df lengths.

Upvotes: 1

Views: 86

Answers (1)

BENY
BENY

Reputation: 323326

Since pandas merge default is inner join , when you not specific the method of how , it will only output the row both in two dfs

For example :

df1=pd.DataFrame(['a'],columns=['names'])
df2=pd.DataFrame(['b','e','a','c','d'],columns=['names'])
pd.merge(df1.reset_index(), df2.reset_index(), on=['names'])
   index_x names  index_y
0        0     a        2

Update

df1=pd.DataFrame(['a','a'],columns=['names'])
df2=pd.DataFrame(['b','e','a','a','c','d'],columns=['names'])

df1.merge(df2)
  names
0     a
1     a
2     a
3     a

Upvotes: 2

Related Questions