Reputation: 79
Pyspark: A merged data (using Left join) rusults in a very large number of rows. Why are there too many resulting rows after merger? Is there anything seriously wrong with my code? Both dataframes have one common key 'Region'.
merged_df = df1.join(df2, on=['Region'] , how = 'left')
merged_df = df1.join(df2, on=['Region'] , how = 'left')
I am expecting more rows but in billions.
Upvotes: 0
Views: 96
Reputation: 8291
Let assume two dataframes:
The left join result is:
In other words, a LEFT JOIN indicates that all records from the LEFT (first) dataframe will be returned, regardless of whether they are present in the RIGHT dataframe. If the right dataframe does not include any matches, the result is null.
For every region in first dataframe it will return all matching regions in second dataframe.
AS kasyap said the probability of getting maximum rows is
47,972 x 852,747 = 40,907,979,084
if Region column is same in both dataframe.
Upvotes: 1