Reputation: 547
I want to recursively read all csv files in a given folder into a Spark SQL DataFrame
using a single path, if possible.
My folder structure looks something like this and I want to include all of the files with one path:
resources/first.csv
resources/subfolder/second.csv
resources/subfolder/third.csv
This is my code:
def read: DataFrame =
sparkSession
.read
.option("header", "true")
.option("inferSchema", "true")
.option("charset", "UTF-8")
.csv(path)
Setting path
to .../resource/*/*.csv
omits 1. while .../resource/*.csv
omits 2. and 3.
I know csv()
also takes multiple strings as path arguments, but want to avoid that, if possible.
note: I know my question is similar to How to import multiple csv files in a single load?, except that I want to include files of all contained folders, independent of their location within the main folder.
Upvotes: 8
Views: 11187
Reputation: 71
You could use RecursiveFileLookup
in spark3 now.
val recursiveLoadedDF = spark.read
.option("recursiveFileLookup", "true")
.csv("resources/")
for more reference: recursive-file-lookup
Upvotes: 7
Reputation: 962
If there are only csv files and only one level of subfolder in your resources
directory then you can use resources/**
.
EDIT
Else you can use Hadoop FileSystem
class to recursively list every csv files in your resources
directory and then pass the list to .csv()
val fs = FileSystem.get(new Configuration())
val files = fs.listFiles(new Path("resources/", true))
val filePaths = new ListBuffer[String]
while (files.hasNext()) {
val file = files.next()
filePaths += file.getPath.toString
}
val df: DataFrame = spark
.read
.options(...)
.csv(filePaths: _*)
Upvotes: 11