Harry Leboeuf
Harry Leboeuf

Reputation: 745

Spark load csv files and memorise filename in column

We have a blob storage where plenty of files are arriving during the whole day. I have a Databricks notebook running in batch read the directorylist, looping the files and send them all into a Azure SQLDW.Works fine. After that the processed files are moved into a archive. But the process of looping the filelist, appending each one of them and adding the filename to a column goes a bit slow. I was wondering if this could be done in 1 run. The loading off all csv's at once can be done, but how to memorise the corresponding filenames in a column.

Anybody has a suggestion ?

Upvotes: 1

Views: 2800

Answers (1)

Girish501
Girish501

Reputation: 143

There are couple of ways which I can think of

1. spark.read.format("csv").load("path").select(input_file_name())

2. spark.sparkContext.wholeTextFiles("path").map{case(x,y) => x} <-- avoid if data is huge

Both provides all filenames in the given path.Where as former one is based on DF might be faster than later RDD one.

Note : Have n't tested the solution.

Upvotes: 5

Related Questions