moski
moski

Reputation: 79

Spark dataframe: Merged data with python results in a very large number of rows

Pyspark: A merged data (using Left join) rusults in a very large number of rows. Why are there too many resulting rows after merger? Is there anything seriously wrong with my code? Both dataframes have one common key 'Region'.

merged_df = df1.join(df2, on=['Region'] , how = 'left')
merged_df = df1.join(df2, on=['Region'] , how = 'left')

I am expecting more rows but in billions.

Upvotes: 0

Views: 96

Answers (1)

Pratik Lad
Pratik Lad

Reputation: 8291

Let assume two dataframes:

enter image description here

The left join result is:

enter image description here

In other words, a LEFT JOIN indicates that all records from the LEFT (first) dataframe will be returned, regardless of whether they are present in the RIGHT dataframe. If the right dataframe does not include any matches, the result is null.

For every region in first dataframe it will return all matching regions in second dataframe.

AS kasyap said the probability of getting maximum rows is 47,972 x 852,747 = 40,907,979,084 if Region column is same in both dataframe.

Upvotes: 1

Related Questions