Reputation: 961
I have many parquet files in S3 directory. The directory structure may vary based on vid. something like this:
bucketname/vid=123/year=2020/month=9/date=12/hf1hfw2he.parquet
bucketname/vid=456/year=2020/month=8/date=13/34jbj.parquet
bucketname/vid=876/year=2020/month=9/date=15/ghg76.parquet
I have a list which contains all the vid something like this
vid_list = ['123','456','876']
How can I read all the files at once for month=9 with out effective performance issue?
current_month=9
temp_df = sqlContext.read.option("mergeSchema", "false").parquet('s3a://bucketname' + 'vid={}/year=2020/month={}/*/*.parquet'.format(*vid_list,current_month))
This is giving me error Path does not exist: file:/Users/home/desktop/test1/vid=123/year=2020/month=456/*/*.parquet;
. Is there any way to achieve this in efficient way?
Upvotes: 0
Views: 3365
Reputation: 4623
Try the following code:
vid_list = '(' + '|'.join(['123','456','876']) + ')'
current_month=9
temp_df = sqlContext.read.option("mergeSchema", "false").parquet('s3://bucketname/' + 'vid={}/year=2020/month={}/*/*.parquet'.format(vid_list,current_month))
// URL should look like: s3://bucketname/vid=(123|456|876)/year=2020/month=9/*/*.parquet
Error in your code: Month value is 456, it should be 9
file:/Users/home/desktop/test1/vid=123/year=2020/month=456/*/*.parquet;
Upvotes: 1