user3888449
user3888449

Reputation: 11

Pig LOAD with SPLIT and COGROUP and number of mappers

I have noticed that the number of mappers in a pig job doubles when I introduce a 'SPLIT' and a 'COGROUP' statement in the pig script after loading. Is this correct? Does anyone know why that happens?

I load a dataset using PigStorage:

A = LOAD 'test.csv' USING PigStorage;

cat test.csv
A   123
A   345
B   234
B   123

I then split the dataset into two relations using SPLIT (the result is the same using a filter). I then cogroup the two relations into one, and store it.

SPLIT A INTO AA IF $0 == 'A', AB IF $0 == 'B';
CG = COGROUP AA BY $1, AB BY $1;

STORE CG INTO 'cg' USING PigStorage();

When I do that, I can see from my (local) output the following lines:

Success!
Job Stats (time in seconds):
JobId   Alias   Feature Outputs
job_local_0001  A,AA,AB,CG  COGROUP /test/cg,

Input(s):
Successfully read records from: "/test/test.csv"
Successfully read records from: "/test/test.csv"

Output(s): Successfully stored records in: "/test/cg"

so it looks like the data is read twice. Indeed, I can see that the number of mappers doubles on a cluster.

What causes this behaviour? Is there a way to avoid it, or does it have a good reason I am missing?

Upvotes: 1

Views: 286

Answers (2)

bhaskarrana
bhaskarrana

Reputation: 9

This will avoid the double file reading

A = LOAD 'test.csv' USING PigStorage(',');
B = GROUP A by $1;
C = FOREACH B {
      AA = FILTER A by $0 == 'A';
      BB= FILTER A by $0 == 'B';
GENERATE FLATTEN($0), AA,BB;};
dump C;

Upvotes: 0

zsxwing
zsxwing

Reputation: 20826

It depends on how you use AA, AB and how you run the script.

  • If you write dump AA; dump AB; in the script or in the grunt, there will be two jobs.
  • If you write store AA into '...'; store AB into '...'; in the grunt, there will be two jobs too.

However, if you write store AA into '...'; store AB into '...'; in the script and use Pig to run this script (not in the grunt), there will be only one job.

Upvotes: 0

Related Questions