Reputation: 11
I have noticed that the number of mappers in a pig job doubles when I introduce a 'SPLIT' and a 'COGROUP' statement in the pig script after loading. Is this correct? Does anyone know why that happens?
I load a dataset using PigStorage:
A = LOAD 'test.csv' USING PigStorage;
cat test.csv
A 123
A 345
B 234
B 123
I then split the dataset into two relations using SPLIT (the result is the same using a filter). I then cogroup the two relations into one, and store it.
SPLIT A INTO AA IF $0 == 'A', AB IF $0 == 'B';
CG = COGROUP AA BY $1, AB BY $1;
STORE CG INTO 'cg' USING PigStorage();
When I do that, I can see from my (local) output the following lines:
Success!
Job Stats (time in seconds):
JobId Alias Feature Outputs
job_local_0001 A,AA,AB,CG COGROUP /test/cg,
Input(s):
Successfully read records from: "/test/test.csv"
Successfully read records from: "/test/test.csv"
Output(s): Successfully stored records in: "/test/cg"
so it looks like the data is read twice. Indeed, I can see that the number of mappers doubles on a cluster.
What causes this behaviour? Is there a way to avoid it, or does it have a good reason I am missing?
Upvotes: 1
Views: 286
Reputation: 9
This will avoid the double file reading
A = LOAD 'test.csv' USING PigStorage(',');
B = GROUP A by $1;
C = FOREACH B {
AA = FILTER A by $0 == 'A';
BB= FILTER A by $0 == 'B';
GENERATE FLATTEN($0), AA,BB;};
dump C;
Upvotes: 0
Reputation: 20826
It depends on how you use AA, AB and how you run the script.
dump AA; dump AB;
in the script or in the grunt, there will be two jobs.store AA into '...'; store AB into '...';
in the grunt, there will be two jobs too.However, if you write store AA into '...'; store AB into '...';
in the script and use Pig to run this script (not in the grunt), there will be only one job.
Upvotes: 0