Spark load csv files and memorise filename in column

Question

We have a blob storage where plenty of files are arriving during the whole day. I have a Databricks notebook running in batch read the directorylist, looping the files and send them all into a Azure SQLDW.Works fine. After that the processed files are moved into a archive. But the process of looping the filelist, appending each one of them and adding the filename to a column goes a bit slow. I was wondering if this could be done in 1 run. The loading off all csv's at once can be done, but how to memorise the corresponding filenames in a column.

Anybody has a suggestion ?

Girish501 · Accepted Answer

There are couple of ways which I can think of

1. spark.read.format("csv").load("path").select(input_file_name())

2. spark.sparkContext.wholeTextFiles("path").map{case(x,y) => x} <-- avoid if data is huge

Both provides all filenames in the given path.Where as former one is based on DF might be faster than later RDD one.

Note : Have n't tested the solution.

Spark load csv files and memorise filename in column

Answers (1)

Related Questions