Reading parquet file by spark using wildcard

Question

I have many parquet files in S3 directory. The directory structure may vary based on vid. something like this:

bucketname/vid=123/year=2020/month=9/date=12/hf1hfw2he.parquet
bucketname/vid=456/year=2020/month=8/date=13/34jbj.parquet
bucketname/vid=876/year=2020/month=9/date=15/ghg76.parquet

I have a list which contains all the vid something like this

vid_list = ['123','456','876']

How can I read all the files at once for month=9 with out effective performance issue?

current_month=9
temp_df = sqlContext.read.option("mergeSchema", "false").parquet('s3a://bucketname' + 'vid={}/year=2020/month={}/*/*.parquet'.format(*vid_list,current_month))

This is giving me error Path does not exist: file:/Users/home/desktop/test1/vid=123/year=2020/month=456/*/*.parquet;. Is there any way to achieve this in efficient way?

Pawan B · Accepted Answer

Try the following code:

vid_list = '(' + '|'.join(['123','456','876']) + ')'
current_month=9
temp_df = sqlContext.read.option("mergeSchema", "false").parquet('s3://bucketname/' + 'vid={}/year=2020/month={}/*/*.parquet'.format(vid_list,current_month))
// URL should look like: s3://bucketname/vid=(123|456|876)/year=2020/month=9/*/*.parquet

Error in your code: Month value is 456, it should be 9

file:/Users/home/desktop/test1/vid=123/year=2020/month=456/*/*.parquet;

Reading parquet file by spark using wildcard

Answers (1)

Related Questions