pyspark read multiple csv files at once

Question

I'm using SPARK to read files in hdfs. There is a scenario, where we are getting files as chunks from legacy system in csv format.

ID1_FILENAMEA_1.csv
ID1_FILENAMEA_2.csv
ID1_FILENAMEA_3.csv
ID1_FILENAMEA_4.csv
ID2_FILENAMEA_1.csv
ID2_FILENAMEA_2.csv
ID2_FILENAMEA_3.csv

This files are loaded to FILENAMEA in HIVE using HiveWareHouse Connector, with few transformation like adding default values. Similarly we have around 70 tables. Hive tables are created in ORC format. Tables are partitioned on ID. Right now, I'm processing all these files one by one. It's taking much time.

I want to make this process much faster. Files will be in GBs.

Is there is any way to read all the FILENAMEA files at the same time and load it to HIVE tables.

Vincent Doba · Accepted Answer

You have two methods to read several CSV files in pyspark. If all CSV files are in the same directory and all have the same schema, you can read then at once by directly passing the path of directory as argument, as follow:

spark.read.csv('hdfs://path/to/directory')

If you have CSV files in different locations or CSV files in same directory but with other CSV/text files in it, you can pass them as string representing a list of path in .csv() method argument, as follow:

spark.read.csv('hdfs://path/to/filename1,hdfs://path/to/filename2')

You can have more information about how to read a CSV file with Spark here

If you need to build this list of paths from the list of files in HDFS directory, you can look at this answer, once you've created your list of paths, you can transform it to a string to pass to .csv() method with ','.join(your_file_list)

pyspark read multiple csv files at once

Answers (2)

Related Questions