Reputation: 438
I have a problem with a join in pig. I'll begin by giving you the context. Here is my code:
-- START file loading
start_file = LOAD 'dir/start_file.csv' USING PigStorage(';') as (PARTRANGE:chararray, COD_IPUSER:chararray);
-- trim
A = FOREACH start_file GENERATE TRIM(PARTRANGE) AS PARTRANGE, TRIM(COD_IPUSER) AS COD_IPUSER;
dump A;
which gives the output:
(79.92.147.88,20140310)
(79.92.147.88,20140310)
(109.31.67.3,20140310)
(109.31.67.3,20140310)
(109.7.229.143,20140310)
(109.8.114.133,20140310)
(77.198.79.99,20140310)
(77.200.174.171,20140310)
(77.200.174.171,20140310)
(109.17.117.212,20140310)
loading the other file:
-- Chargement du fichier recherche Hadopi
file2 = LOAD 'dir/file2.csv' USING PigStorage(';') as (IP_RECHERCHEE:chararray, DATE_HADO:chararray);
dump file2;
the output is this:
(2014/03/10 00:00:00,79.92.147.88)
(2014/03/10 00:00:01,79.92.147.88)
(2014/03/10 00:00:00,192.168.2.67)
Now, I want to do a left outer join. Here's the code:
result = JOIN file2 by IP_RECHERCHEE LEFT OUTER, A by COD_IPUSER;
dump result;
The output is this:
(2014/03/10 00:00:00,79.92.147.88,,)
(2014/03/10 00:00:00,192.168.2.67,,)
(2014/03/10 00:00:01,79.92.147.88,,)
All the records of "file2" are here, which is fine but any of start_file are here. It is like if the join has failed.
Do you know where the problem is ?
Thanks.
Upvotes: 0
Views: 630
Reputation: 76
Your fields' name are wrong, and you join by the wrong field. It seems that you want to join by the IP address.
start_file = LOAD 'dir/start_file.csv' USING PigStorage(';') as (IP:chararray, PARTRANGE:chararray);
A = FOREACH start_file GENERATE TRIM(IP) AS IP, TRIM(PARTRANGE) AS PARTRANGE;
file2 = LOAD 'dir/file2.csv' USING PigStorage(';') as (DATE_HADO:chararray, IP:chararray);
what I got is this
(2014/03/10 00:00:00,192.168.2.67,,)
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,20140310)
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,20140310)
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,20140310)
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,20140310)
Upvotes: 1
Reputation: 1574
The result is as expected. You are calling Left outer join , which looks for matching of IP_RECHERCHEE field in file2 with COD_IPUSER of A.
Since there is no matching, it returns all the IP_RECHERCHEE fields in file2 and puts null in place of fields from A.
Clearly 2014/03/10 00:00:00 != 20140310
Upvotes: 1
Reputation: 5801
You have mislabeled your fields in file2
. You are calling the first field the IP and the second field the date, when, as shown by your dump
, the opposite is the case. Try FOREACH file2 GENERATE IP_RECHERCHEE
and you will see the fields you are trying to join on.
Upvotes: 3