shanks_roux
shanks_roux

Reputation: 438

Pig - Join doesn't work

I have a problem with a join in pig. I'll begin by giving you the context. Here is my code:

-- START file loading
start_file = LOAD 'dir/start_file.csv' USING PigStorage(';') as (PARTRANGE:chararray,     COD_IPUSER:chararray);

-- trim
A = FOREACH start_file GENERATE TRIM(PARTRANGE) AS PARTRANGE, TRIM(COD_IPUSER) AS COD_IPUSER;

dump A;

which gives the output:

(79.92.147.88,20140310)
(79.92.147.88,20140310)
(109.31.67.3,20140310)
(109.31.67.3,20140310)
(109.7.229.143,20140310)
(109.8.114.133,20140310)
(77.198.79.99,20140310)
(77.200.174.171,20140310)
(77.200.174.171,20140310)
(109.17.117.212,20140310)

loading the other file:

-- Chargement du fichier recherche Hadopi
file2 = LOAD 'dir/file2.csv' USING PigStorage(';') as (IP_RECHERCHEE:chararray, DATE_HADO:chararray);

dump file2;

the output is this:

(2014/03/10 00:00:00,79.92.147.88)
(2014/03/10 00:00:01,79.92.147.88)
(2014/03/10 00:00:00,192.168.2.67)

Now, I want to do a left outer join. Here's the code:

result = JOIN file2 by IP_RECHERCHEE LEFT OUTER, A by COD_IPUSER;
dump result;

The output is this:

(2014/03/10 00:00:00,79.92.147.88,,)
(2014/03/10 00:00:00,192.168.2.67,,)
(2014/03/10 00:00:01,79.92.147.88,,)

All the records of "file2" are here, which is fine but any of start_file are here. It is like if the join has failed.

Do you know where the problem is ?

Thanks.

Upvotes: 0

Views: 630

Answers (3)

IronManZ
IronManZ

Reputation: 76

Your fields' name are wrong, and you join by the wrong field. It seems that you want to join by the IP address.

start_file = LOAD 'dir/start_file.csv' USING PigStorage(';') as (IP:chararray, PARTRANGE:chararray);

A = FOREACH start_file GENERATE TRIM(IP) AS IP, TRIM(PARTRANGE) AS PARTRANGE;

file2 = LOAD 'dir/file2.csv' USING PigStorage(';') as (DATE_HADO:chararray, IP:chararray);

what I got is this

(2014/03/10 00:00:00,192.168.2.67,,)
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,20140310)
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,20140310)
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,20140310)
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,20140310)

Upvotes: 1

Chandra kant
Chandra kant

Reputation: 1574

The result is as expected. You are calling Left outer join , which looks for matching of IP_RECHERCHEE field in file2 with COD_IPUSER of A.
Since there is no matching, it returns all the IP_RECHERCHEE fields in file2 and puts null in place of fields from A.
Clearly 2014/03/10 00:00:00 != 20140310

Upvotes: 1

reo katoa
reo katoa

Reputation: 5801

You have mislabeled your fields in file2. You are calling the first field the IP and the second field the date, when, as shown by your dump, the opposite is the case. Try FOREACH file2 GENERATE IP_RECHERCHEE and you will see the fields you are trying to join on.

Upvotes: 3

Related Questions