Reputation: 841
I have a set of input files to process using Pig, with the following naming structure:
/user/hdp/input/custom/Fold1/train0.txt
/user/hdp/input/custom/Fold1/train1.txt
/user/hdp/input/custom/Fold1/train2.txt
/user/hdp/input/custom/Fold1/train3.txt
...
/user/hdp/input/custom/Fold1/train9.txt
/user/hdp/input/custom/Fold1/train10.txt
/user/hdp/input/custom/Fold1/train11.txt
/user/hdp/input/custom/Fold1/train12.txt
...
up to training file 99. I build my Pig script dynamically as a Java String, which I then submit to my cluster. I am looking for a general solution to load the range of train files from 0 up to some number x, where I can set this x to any java int up to 99.
In a previous version of my solution, that supported values of x up to 9, I used the Pig support for globs in the following way:
pigString += "TRAIN = LOAD 'user/hdp/input/custom/Fold1/train[0-"+x+"].txt' USING PigStorage(' ');";
This approach does not scale to values greater than 9, as from 10 it starts to take up two characters instead of one. One potential solution would be splitting x into a single digit and use this to build the pig String.
int tens = x/10;
int single = x%10;
if(tens>0)
pigString += "TRAIN = LOAD 'user/hdp/input/custom/Fold1/train[0-"+tens+"][0-+"single"+.txt' USING PigStorage(' ');";
else
pigString += "TRAIN = LOAD 'user/hdp/input/custom/Fold1/train[0-"+single+"].txt' USING PigStorage(' ');";
This solutions however has two problems.
Does anyone know any generic solution to load my range of data files up to any value of x? I don't know if I'm at the right track using glob's, so any other non-glob solution would also be very much appreciated.
Many thanks in advance!
Upvotes: 1
Views: 765
Reputation: 979
I looked at hadoop glob signature, and it seems like it should be easy to do than what we thought initially.
Create a comma separated string of all the numbers that you are interested in and call it expectedNumber. e.g. expectedNumbers = "0,1,2,3,4,5" and then use it as below:
pigString += "TRAIN = LOAD 'user/hdp/input/custom/Fold1/train" + {expectedNumbers} +".txt' USING PigStorage(' ');";
Hope this helps.
Upvotes: 1