安安朱
安安朱

Reputation: 33

how should I express the hdfs path in spark textfile?

I want to load data like path :

hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-04/*/*
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-05/*/*
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-06/*/*
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-07/*/*
...
hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-14/*/*`

this is my code

val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-"+"1[0-3]".r+"/*/*")`

and

val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-"+"0[4-9]".r+"/*/*")

either is ok,but

val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-"+"0[0-9]|1[0-4]".r+"/*/*")

doesn`t work

how should I write the path pattern to load 04-13 all the data

Upvotes: 3

Views: 59

Answers (2)

pheeleeppoo
pheeleeppoo

Reputation: 1525

Try to use the following syntax for alternation:

  • {a,b} instead of (a|b)

So in your case the load of the text file would be like the following:

val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info/2019-02-{0[4-9],1[0-3]}/*/*")

to load all files from 2019-02-04 to 2019-02-13 subdirectories.

Upvotes: 1

RefiPeretz
RefiPeretz

Reputation: 563

This is not exactly an answer more a best practice/suggestion, if you are able to control path syntax try to save your paths with date partitions:

hdfs://dcoshdfs/encrypt_data/gmap_info/date=20190519
hdfs://dcoshdfs/encrypt_data/gmap_info/date=20190418
.
.
.
hdfs://dcoshdfs/encrypt_data/gmap_info/date20160101

Than you can simply extract what ever you want using spark:

val data = sc.textFile("hdfs://dcoshdfs/encrypt_data/gmap_info")`.where('date >= 20190204L && 'date <= 20190213L)

This is the most optimized approach since spark load exactly the data which it need and doesn't use partition discovery, pulse it is much more readable.

Upvotes: 1

Related Questions