Reputation: 15
I have to read a ''.parquet'' file that is in multiple folders of different years. This is not a problem when it is 1 or 2 years, however, the matter becomes more complicated when it is more than two years, since I must read the 12 subdirectories corresponding to each month. I show an example of how I do it in an inefficient way.
df_2019_01=spark.read.parquet('/2019/01/name.parquet/')
df_2019_02=spark.read.parquet('/2019/02/name.parquet/')
df_2019_03=spark.read.parquet('/2019/03/name.parquet/')
df_2019_04=spark.read.parquet('/2019/04/name.parquet/')
#...
df_2019_12=spark.read.parquet('/2019/12/name.parquet/')
df_2020_01=spark.read.parquet('/2020/01/name.parquet/')
df_2020_02=spark.read.parquet('/2020/02/name.parquet/')
df_2020_03=spark.read.parquet('/2020/03/name.parquet/')
df_2020_04=spark.read.parquet('/2020/04/name.parquet/')
#...
df_2020_12=spark.read.parquet('/2020/12/name.parquet/')
df = df_2019_01.union(df_2019_02).union(df_2019_03).union(df_2019_04).union(df_2020_12)
Upvotes: 0
Views: 713
Reputation: 4059
Change the year and month by *
:
df = spark.read.parquet('/*/*/*.parquet')
All parquet files must have the same schema, otherwise your final dataframe will have missing columns. You can try this option:
mergedDF = spark.read.option("mergeSchema", "true").parquet('/*/*/*.parquet')
Your problem is similar to this question, if you want retrieve the year and month, just follow my answer there
Upvotes: 1