mmnormyle
mmnormyle

Reputation: 883

Spark read multiple CSV files, one partition for each file

suppose I have multiple CSV files in the same directory, these files all share the same schema.

/tmp/data/myfile1.csv, /tmp/data/myfile2.csv, /tmp/data.myfile3.csv, /tmp/datamyfile4.csv

I would like to read these files into a Spark DataFrame or RDD, and I would like each file to be a parition of the DataFrame. How can I do this?

Upvotes: 6

Views: 2210

Answers (1)

Ryan Widmaier
Ryan Widmaier

Reputation: 8513

You have two options I can think of:

1) Use the Input File name

Instead of trying to control the partitioning directly, add the name of the input file to your DataFrame and use that for any grouping/aggregation operations you need to do. This is probably your best option as it is more aligned with the parallel processing intent of spark where you tell it what to do and let it figure out the how. You do this with code like this:

SQL:

SELECT input_file_name() as fname FROM dataframe

Or Python:

from pyspark.sql.functions import input_file_name

newDf = df.withColumn("filename", input_file_name())

2) Gzip your CSV files

Gzip is not a splittable compression format. This means when loading gzipped files, each file will be it's own partition.

Upvotes: 3

Related Questions