Abhijit Bashetti
Abhijit Bashetti

Reputation: 8658

Apache pig group by function is not giving expected output

I have data in csv format as shown below.

The data has the below format

"first_name","last_name","company_name","address","city","county","postal","phone1","phone2","email","web"

The sample data named under User.csv. The file contains below data.

"Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","[email protected]","http://www.alandrosenburgcpapc.co.uk"
"Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","[email protected]","http://www.capgeminiamerica.co.uk"
"France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","[email protected]","http://www.elliottjohnwesq.co.uk"

When I try the same to load using PigStorage

user = LOAD '/home/abhijit/Downloads/User.csv' USING PigStorage(',');

DUMP user;

The output of it is like :

("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","[email protected]","http://www.alandrosenburgcpapc.co.uk")
("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","[email protected]","http://www.capgeminiamerica.co.uk")
("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","[email protected]","http://www.elliottjohnwesq.co.uk")

I want to do a group by on city. So I have written

grp = group user by $4; 
dump grp;

I get the output as :

( Binney St",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","[email protected]","http://www.capgeminiamerica.co.uk")})
("8 Moor Place",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","[email protected]","http://www.elliottjohnwesq.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","[email protected]","http://www.alandrosenburgcpapc.co.uk")})

The company_name and address is creating a problem as it contains ',' as part of it. for example "14, Taylor St" in address or "Elliott, John W Esq" in company_name.

so my $4 is treated for "Taylor St" and not the "St. Stephens Ward"

So because of the extra delimiter in the address data or the company_name data is not loaded properly or seperated properly and the group by fuction is not giving correct result.

How can I achieve the group by output as below

("Abbey Ward",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","[email protected]","http://www.capgeminiamerica.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","[email protected]","http://www.alandrosenburgcpapc.co.uk")})
("East Southbourne and Tuckton W",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","[email protected]","http://www.elliottjohnwesq.co.uk")})


grp = group a by $5 ;

It won't be the solution for me. I already thought of it.

Upvotes: 0

Views: 83

Answers (1)

LiMuBei
LiMuBei

Reputation: 3078

The problem is that PigStorage does not take escaping into account, so creates columns for fields that should not be columns (each time an entry contains a comma).

Using CSVExcelStorage will solve this as this storage can deal with escaping, thus creating the right amount and sequence of columns.

Upvotes: 1

Related Questions