Lusian
Lusian

Reputation: 653

Inner join in pandas results into cartesian product

this is a very general question. Is it possible that by performing an inner join in pandas, the resulting merged db has more observations than the maximum observation number of the two datasets. In other words, if I have a db with 30181537 obs and a database with 23483111 observations, how is it possible that the resulting database has #112039626 observations if I perform an inner merge on a variable v1? Variable v1 contains duplicates in both datasets.

Thanks

Upvotes: 0

Views: 54

Answers (1)

cmauck10
cmauck10

Reputation: 163

Because you have duplicates of the v1 column in both data sets, you'll get i * j rows with that merge column value, where i is the number of rows with that value in dataframe A and j is the number of rows with that value in dataframe B.

If you don't want this, try using

df_A = df_A.drop_duplicates(subset=['v1'])
df_B = df_B.drop_duplicates(subset=['v1'])

Upvotes: 1

Related Questions