Asif
Asif

Reputation: 733

Load multiple files from multiple folders in spark

I am having a data set that contains multiple folders inside main folder and each folder contains multiple CSV files. Every CSV file has three columns named X,Y and Z. I want to create a dataframe so that first three columns of dataframe are three X,Y,Z. I want two more columns such that fourth column contains name of folder from which CSV file is read. Fifth column contains the name of CSV file. How can I create this dataframe in Scala and Spark?

Upvotes: 5

Views: 6635

Answers (1)

notNull
notNull

Reputation: 31490

You can use spark.read.csv then use input_file_name to get the filename and extract directory from the filename.

Example:

1.extracting directory from filename:

// Lets take we have directory `tmp2` with folders having csv files in it
tmp2
|-folder1
|-folder2

//extracting directory from filename

spark.read.option("header",true).
csv("tmp2/*").
withColumn("file_name",input_file_name()).
withColumn("directory",element_at(reverse(split(col("file_name"),"/")),2)).
show()

//+----+---+---------------------------+---------+
//|name|id |file_name                  |directory|
//+----+---+---------------------------+---------+
//|2   |b  |file:///tmp2/folder2/t1.csv|folder2  |
//|1   |a  |file:///tmp2/folder1/t.csv |folder1  |
//+----+---+---------------------------+---------+

2. Get folder name while reading file:

If you have folder structure like folder=<val> then spark reads folder as partition column and add folder as partition column.

//folder structure

tmp3
|-folder=1
|-folder=2

spark.read.
option("header",true).
csv("tmp3").\
withColumn("file_name",input_file_name()).
show(false)

//+----+---+------+---------------------------+
//|name|id |folder|file_name                  |
//+----+---+------+---------------------------+
//|a   |1  |2     |file:///tmp3/folder=2/t.txt|
//|a   |1  |1     |file:///tmp3/folder=1/t.txt|
//+----+---+------+---------------------------+

Upvotes: 9

Related Questions