Alejandro Montenegro
Alejandro Montenegro

Reputation: 15

¿How do I read multiple files from multiple folders in Python

I have to read a ''.parquet'' file that is in multiple folders of different years. This is not a problem when it is 1 or 2 years, however, the matter becomes more complicated when it is more than two years, since I must read the 12 subdirectories corresponding to each month. I show an example of how I do it in an inefficient way.

Step 1: Read files

YEAR 2019

df_2019_01=spark.read.parquet('/2019/01/name.parquet/')
df_2019_02=spark.read.parquet('/2019/02/name.parquet/')
df_2019_03=spark.read.parquet('/2019/03/name.parquet/')
df_2019_04=spark.read.parquet('/2019/04/name.parquet/')

#...

df_2019_12=spark.read.parquet('/2019/12/name.parquet/')

YEAR 2020

df_2020_01=spark.read.parquet('/2020/01/name.parquet/')
df_2020_02=spark.read.parquet('/2020/02/name.parquet/')
df_2020_03=spark.read.parquet('/2020/03/name.parquet/')
df_2020_04=spark.read.parquet('/2020/04/name.parquet/')

#...

df_2020_12=spark.read.parquet('/2020/12/name.parquet/')

Step 2: Union files (every month of every year). NOTE: 1) all files have the same structure; 2) the file name is the same in all folders.

df = df_2019_01.union(df_2019_02).union(df_2019_03).union(df_2019_04).union(df_2020_12)

Upvotes: 0

Views: 713

Answers (1)

Kafels
Kafels

Reputation: 4059

Change the year and month by *:

df = spark.read.parquet('/*/*/*.parquet')

All parquet files must have the same schema, otherwise your final dataframe will have missing columns. You can try this option:

mergedDF = spark.read.option("mergeSchema", "true").parquet('/*/*/*.parquet')

Your problem is similar to this question, if you want retrieve the year and month, just follow my answer there

Upvotes: 1

Related Questions