Reputation: 2695
I have following data in file
"a","b","1","2"
"a","b","4","3"
"a","b","3","1"
I am reading this file using below command
File1 = LOAD '/path' using PigStorage (',') as (f1:chararray,f2:chararray,f3:int,f4:int)
But here it is ignoring the data of field 3 and 4
How to read this file correctly or any way to make PIG skip '"'
Additional information i am using Apache Pig version 0.10.0
Upvotes: 4
Views: 4748
Reputation: 1
You can use CSVExcelStorage loader from Pig. The double quotes in data are handled by this loader. You have to register Piggy-bank jar for using this loader.
Register ${jar_location}/piggybank-0.15.0.jar;
load_data = load '${data_location}' using
org.apache.pig.piggybank.storage.CSVExcelStorage(',');
Hope this helps.
Upvotes: 0
Reputation: 11
Try below (No need to escape or replace double quotes) :
using org.apache.pig.piggybank.storage.CSVExcelStorage()
Upvotes: 1
Reputation: 56
How about use REPLACE? if case is this simple?
data = LOAD 'YOUR_DATA' Using PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray) ;
new_data = foreach data generate
REPLACE(a, '"', '') AS a,
REPLACE(b, '"', '') AS b,
(int)REPLACE(c, '"', '') AS c:int,
(int)REPLACE(d, '"', '') AS d:int;
One more tips: If you are loading a csv file, set a correct number format in an Excel like tools might also help.
Upvotes: 0
Reputation: 4106
You may use the REPLACE
function (it won't be in one pass though) :
file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0 as (f1:chararray), $1 as (f2:chararray), REPLACE($2, '\\"', '') as (f3:int), REPLACE($3, '\\"', '') as (f4:int);
You may also use regexes with REGEX_EXTRACT
:
file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0, $1, REGEX_EXTRACT($2, '([0-9]+)', 1), REGEX_EXTRACT($3, '([0-9]+)', 1);
Of course, you could erase "
for f1 and f2 the same way.
Upvotes: 3
Reputation: 17593
If you have Jython
installed you could deploy a simple UDF
to accomplish the job.
python UDF
#!/usr/bin/env python
'''
udf.py
'''
@outputSchema("out:chararray")
def formatter(item):
chars = 'abcdefghijklmnopqrstuvwxyz'
nums = '1234567890'
new_item = item.split('"')[1]
if new_item in chars:
output = str(new_item)
elif new_item in nums:
output = int(new_item)
return output
pig script
REGISTER 'udf.py' USING jython as udf;
data = load 'file' USING PigStorage(',') AS (col1:chararray, col2:chararray,
col3:chararray, col4:chararray);
out = foreach data generate udf.formatter(col1) as a, udf.formatter(col3) as b;
dump out
(a,1)
(a,4)
(a,3)
Upvotes: 0