Reputation: 772

How to perform a DISTINCT in Pig Latin on a subset of columns?

I would like to perform a DISTINCT operation on a subset of the columns. The documentation says this is possible with a nested foreach:

You cannot use DISTINCT on a subset of fields; to do this, use FOREACH and a nested block to first select the fields and then apply DISTINCT (see Example: Nested Block).

It is simple to perform a DISTINCT operation on all of the columns:

A = LOAD 'data' AS (a1,a2,a3,a4);
A_unique = DISTINCT A;

Lets say that I am interested in performing the distinct across a1, a2, and a3. Can anyone provide an example showing how to perform this operation with a nested foreach as suggested in the documentation?

Here's an example of input and expected output:

A = LOAD 'data' AS(a1,a2,a3,a4);
DUMP A;

(1 2 3 4)
(1 2 3 4)
(1 2 3 5)
(1 2 4 4)

-- insert DISTINCT operation on a1,a2,a3 here:
-- ...

DUMP A_unique;

(1 2 3 4)
(1 2 4 4)

Upvotes: 13

Answers (6)

Sebastián Vásquez

Reputation: 135

I was looking to do the same: "I would like to perform a DISTINCT operation on a subset of the columns". The way I did it was:

A = LOAD 'data' AS(a1,a2,a3,a4);
interested_fields = FOREACH A GENERATE a1,a2,a3;
distinct_fields= DISTINCT interested_fields;
final_answer = FOREACH distinct_fields GENERATE FLATTEN($0);

I know it's not an example of how to perform a nested foreach as suggested in the documentation; but it's a way of doing a distinct over a subset of fields. Hope It helps to anyone who gets here just like I did.

Upvotes: 0

costa xu

Reputation: 11

unique_A = FOREACH (GROUP A BY (a1, a2, a3)) {
    limit_a = LIMIT A 1;
    GENERATE FLATTEN(limit_a) AS (a1,a2,a3,a4);
};

Upvotes: 0

Prasoon Joshi

Reputation: 799

The accepted answer is one great solution but, in case you want to reorder the fields in the output (something I had to do recently) this might not work. Here's an alternative:

A = LOAD '$input' AS (f1, f2, f3, f4, f5);
GP = GROUP A BY (f1, f2, f3);
OUTPUT = FOREACH GP GENERATE 
    group.f1, group.f2, f4, f5, group.f3 ;

When you group on certain fields, the selection would have unique values for the group in a each tuple.

Upvotes: 2

one2go

Reputation: 11

For your specified input/output, the following works. You might update your test vectors to clarify what you need that is different than this.

A_unique = DISTINCT A;

Upvotes: 1

reo katoa

Reputation: 5801

Group on all the other columns, project just the columns of interest into a bag, and then use FLATTEN to expand them out again:

A_unique =
    FOREACH (GROUP A BY a4) {
        b = A.(a1,a2,a3);
        s = DISTINCT b;
        GENERATE FLATTEN(s), group AS a4;
    };

Upvotes: 21

Freerobots

Reputation: 772

Here are 2 possible solutions, are there any other good approaches?

Solution 1 (using LIMIT 1):

A = LOAD 'test_data' AS (a1,a2,a3,a4);

-- Combine the columns that I want to perform the distinct across into a tuple
A2 = FOREACH A GENERATE TOTUPLE(a1,a2,a3) AS combined, a4 as a4

-- Group by the combined column
grouped_by_a4 = GROUP A2 BY combined;

grouped_and_distinct = FOREACH grouped_by_a4 {
        single = LIMIT A2 1;
        GENERATE FLATTEN(single);
};

Solution 2 (using DISTINCT):

A = LOAD 'test_data' AS (a1,a2,a3,a4);

-- Combine the columns that I want to perform the distinct across into a tuple
A2 = FOREACH A GENERATE TOTUPLE(a1,a2,a3) AS combined, a4 as a4

-- Group by the other columns (those I don't want the distinct applied to)
grouped_by_a4 = GROUP A2 BY a4;

-- Perform the distinct on a projection of combined and flatten 
grouped_and_distinct = FOREACH grouped_by_a4 {
        combined_unique = DISTINCT A2.combined;
        GENERATE FLATTEN(combined_unique);
};

Upvotes: 0

How to perform a DISTINCT in Pig Latin on a subset of columns?

Answers (6)

Related Questions