Reputation: 1912
I have a client which places the CSV files in Nested Directories as below, I need to read these files in real-time. I am trying to do this using Spark Structured Streaming.
Data:
/user/data/1.csv
/user/data/2.csv
/user/data/3.csv
/user/data/sub1/1_1.csv
/user/data/sub1/1_2.csv
/user/data/sub1/sub2/2_1.csv
/user/data/sub1/sub2/2_2.csv
Code:
val csvDF = spark
.readStream
.option("sep", ",")
.schema(userSchema) // Schema of the csv files
.csv("/user/data/")
Any configurations to be added to allow spark reading from nested directories in Structured Streaming.
Upvotes: 3
Views: 3308
Reputation: 1912
I am able to stream the files in sub-directories using glob path.
Posting here for the sake of others.
inputPath = "/spark_structured_input/*?*"
inputDF = spark.readStream.option("header", "true").schema(userSchema).csv(inputPath)
query = inputDF.writeStream.format("console").start()
Upvotes: 3
Reputation: 1233
As far as I know, Spark has no such options, but it supports glob usage in path.
val csvDF = spark
.readStream
.option("sep", ",")
.schema(userSchema) // Schema of the csv files
.csv("/user/data/*/*")
Maybe it may help you to design your glob path and use it in one stream.
Hope it helps!
Upvotes: 1