Reputation: 352
I have 5 CSV files in a file, and want to join them in one data frame in Pyspark: I use the code below:
name_file =['A', 'B', 'C', 'D', 'V']
for n in name_file:
n= spark.read.csv(fullpath+n+'.csv'
,header=False,
inferSchema= True)
full_data=full_data.join(n,["id"])
Error: I got an unexpected result > The last dataframe joined just with itself.
Expected Result: There should be 6 columns, each CSV has 2 data frames one of them in common with others. The join should be on this column. As a result, the final data frame should have a common column and 5 special columns from each CSV file.
Upvotes: 1
Views: 2393
Reputation: 2722
There seem to be several things wrong with the code or perhaps you have not provided the complete code.
I ran a small test on the below code which worked for me and addresses the questions I've raised above. You can adjust it to your need.
fullpath = '/content/sample_data/'
full_data = spark.read.csv(fullpath+'Book1.csv'
,header=True,
inferSchema= True)
name_file =['Book2', 'Book3']
for n in name_file:
n= spark.read.csv(fullpath+n+'.csv'
,header=True,
inferSchema= True)
full_data=full_data.join(n,["id"])
full_data.show(5)
Upvotes: 3