Reputation: 162
In Amazon S3 i have a folder with around 30 subfolders, in each subfolder contains one csv file.
I want a simple way to read each csv file from all the subfolders - currently, i can do this by specifying the path n times but i feel there must be a more concise way.
e.g. dataframe = sqlContext.read.csv([ path1, path2, path3,etc..], header=True)
Upvotes: 1
Views: 2625
Reputation: 2767
Emulating your situation like this (using jupyter magic commands so you can see folder structure)
... just use * ... also assuming each csv has the same # of cols
! ls sub_csv/
print("="*10)
! ls sub_csv/csv1/
! ls sub_csv/csv2/
! ls sub_csv/csv3/
print("="*10)
! cat sub_csv/csv1/*.csv
! cat sub_csv/csv2/*.csv
! cat sub_csv/csv3/*.csv
csv1
csv2
csv3
==========
csv1.csv
csv2.csv
csv3.csv
==========
id
1
id
2
id
3
spark\
.read\
.option("header", "true")\
.csv("sub_csv/*")\
.show()
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
Upvotes: 1