How to process all the files in a directory parallelly without reading into a single dataframe spark?

Question

I have a use case to read and process all the files from a directory and create separate output files after applying some transformation in spark. Here I want to perform parallel processing in spark to apply all the required transformations to the currently present files in the landing directory.

Below is the sample code that I tried but it is not working.

def fileList = .... //to fetch file names from a directory

def businessLogic() //where I am doing all the operations. (read from a file, transformation, etc)

fileList().map(businessLogic) // calling business logic in parallel

Could you please let me know how can I achieve parallel processing?

Note: The number of files can be N, reading all the files into a dataframe is not an option as I have to create output files for each file, triggering multiple spark job is also not an option.

Thanks, Sourav

ELinda · Accepted Answer

Here's an example which is basically what Srinivas has suggested in the comments.

The key here is the function input_file_name which provides the original file name.

Note that if fileList is a standard (not distributed) data structure (such as a DataFrame/Dataset/RDD), then actions and transformations such as forEach are not done in parallel. If you would like to use native Scala to achieve the parallel execution, you can look into Futures.

// spark is a SparkSession
// input_file_name is spark.sql.functions.input_file_name
val df = spark.read.option("header", "true").option("inferSchema", "true").csv("path/to/data")
val df2 = df.withColumn("input", substring_index(input_file_name(), "/", -1))
df2.write.partitionBy("input").option("header", "true").csv("output")

How to process all the files in a directory parallelly without reading into a single dataframe spark?

Answers (2)

Related Questions