Reputation: 745
We have a blob storage where plenty of files are arriving during the whole day. I have a Databricks notebook running in batch read the directorylist, looping the files and send them all into a Azure SQLDW.Works fine. After that the processed files are moved into a archive. But the process of looping the filelist, appending each one of them and adding the filename to a column goes a bit slow. I was wondering if this could be done in 1 run. The loading off all csv's at once can be done, but how to memorise the corresponding filenames in a column.
Anybody has a suggestion ?
Upvotes: 1
Views: 2800
Reputation: 143
There are couple of ways which I can think of
1. spark.read.format("csv").load("path").select(input_file_name())
2. spark.sparkContext.wholeTextFiles("path").map{case(x,y) => x} <-- avoid if data is huge
Both provides all filenames in the given path.Where as former one is based on DF might be faster than later RDD one.
Note : Have n't tested the solution.
Upvotes: 5