Join different DataFrames using loop in Pyspark

Question

I have 5 CSV files in a file, and want to join them in one data frame in Pyspark: I use the code below:

name_file =['A', 'B', 'C', 'D', 'V']
for n in name_file:
n= spark.read.csv(fullpath+n+'.csv'
                ,header=False, 
                inferSchema= True)
full_data=full_data.join(n,["id"])

Error: I got an unexpected result > The last dataframe joined just with itself.

Expected Result: There should be 6 columns, each CSV has 2 data frames one of them in common with others. The join should be on this column. As a result, the final data frame should have a common column and 5 special columns from each CSV file.

Anupam Chand · Accepted Answer

There seem to be several things wrong with the code or perhaps you have not provided the complete code.

Have you defined fullpath?
You have set header=False then how will spark know that there is an "id" column?
Your indentation looks wrong under the for loop.
full_data has not been defined yet, so how are you using it on the right side of the evaluation within the for loop? I suspect you have initialized this to the first csv file and then attempting to join it with first csv again.

I ran a small test on the below code which worked for me and addresses the questions I've raised above. You can adjust it to your need.

fullpath = '/content/sample_data/'
full_data = spark.read.csv(fullpath+'Book1.csv'
                      ,header=True, 
                      inferSchema= True)
name_file =['Book2', 'Book3']
for n in name_file:
  n= spark.read.csv(fullpath+n+'.csv'
                      ,header=True, 
                      inferSchema= True)
  full_data=full_data.join(n,["id"])
full_data.show(5)

Join different DataFrames using loop in Pyspark

Answers (1)

Related Questions