Reputation: 33
I want to load data like path :
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-04/*/*
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-05/*/*
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-06/*/*
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-07/*/*
...
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-14/*/*`
this is my code
val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-"+"1[0-3]".r+"/*/*")`
and
val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-"+"0[4-9]".r+"/*/*")
either is ok,but
val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-"+"0[0-9]|1[0-4]".r+"/*/*")
doesn`t work
how should I write the path pattern to load 04-13 all the data
Upvotes: 3
Views: 59
Reputation: 1525
Try to use the following syntax for alternation:
{a,b}
instead of (a|b)
So in your case the load of the text file would be like the following:
val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-{0[4-9],1[0-3]}/*/*")
to load all files from 2019-02-04
to 2019-02-13
subdirectories.
Upvotes: 1
Reputation: 563
This is not exactly an answer more a best practice/suggestion, if you are able to control path syntax try to save your paths with date partitions:
hdfs://dcoshdfs/encrypt_data/gmap_info/date=20190519
hdfs://dcoshdfs/encrypt_data/gmap_info/date=20190418
.
.
.
hdfs://dcoshdfs/encrypt_data/gmap_info/date20160101
Than you can simply extract what ever you want using spark:
val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info")`.where('date >= 20190204L && 'date <= 20190213L)
This is the most optimized approach since spark load exactly the data which it need and doesn't use partition discovery, pulse it is much more readable.
Upvotes: 1