Vikas Hardia
Vikas Hardia

Reputation: 2695

how can i ignore " (double quotes) while loading file in PIG?

I have following data in file

"a","b","1","2"
"a","b","4","3"
"a","b","3","1"

I am reading this file using below command

File1 = LOAD '/path' using PigStorage (',') as (f1:chararray,f2:chararray,f3:int,f4:int)

But here it is ignoring the data of field 3 and 4

How to read this file correctly or any way to make PIG skip '"'

Additional information i am using Apache Pig version 0.10.0

Upvotes: 4

Views: 4748

Answers (5)

BugBake
BugBake

Reputation: 1

You can use CSVExcelStorage loader from Pig. The double quotes in data are handled by this loader. You have to register Piggy-bank jar for using this loader.

Register ${jar_location}/piggybank-0.15.0.jar;

load_data = load '${data_location}' using 
org.apache.pig.piggybank.storage.CSVExcelStorage(',');

Hope this helps.

Upvotes: 0

SKC
SKC

Reputation: 11

Try below (No need to escape or replace double quotes) :

using org.apache.pig.piggybank.storage.CSVExcelStorage() 

Upvotes: 1

Wenzhong
Wenzhong

Reputation: 56

How about use REPLACE? if case is this simple?

data = LOAD 'YOUR_DATA' Using PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray) ;

new_data = foreach data generate 
   REPLACE(a, '"', '') AS a,
   REPLACE(b, '"', '') AS b, 
   (int)REPLACE(c, '"', '') AS c:int, 
   (int)REPLACE(d, '"', '') AS d:int;

One more tips: If you are loading a csv file, set a correct number format in an Excel like tools might also help.

Upvotes: 0

merours
merours

Reputation: 4106

You may use the REPLACE function (it won't be in one pass though) :

file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0 as (f1:chararray), $1 as (f2:chararray), REPLACE($2, '\\"', '') as (f3:int), REPLACE($3, '\\"', '') as (f4:int);

You may also use regexes with REGEX_EXTRACT :

file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0, $1, REGEX_EXTRACT($2, '([0-9]+)', 1), REGEX_EXTRACT($3, '([0-9]+)', 1);

Of course, you could erase " for f1 and f2 the same way.

Upvotes: 3

o-90
o-90

Reputation: 17593

If you have Jython installed you could deploy a simple UDF to accomplish the job.

python UDF

#!/usr/bin/env python

'''
udf.py
'''

@outputSchema("out:chararray")
def formatter(item):
    chars = 'abcdefghijklmnopqrstuvwxyz'
    nums = '1234567890'
    new_item = item.split('"')[1]
    if new_item in chars:
        output = str(new_item)
    elif new_item in nums:
        output = int(new_item)

    return output

pig script

REGISTER 'udf.py' USING jython as udf;
data = load 'file' USING PigStorage(',') AS (col1:chararray, col2:chararray,
       col3:chararray, col4:chararray);
out = foreach data generate udf.formatter(col1) as a, udf.formatter(col3) as b;
dump out

(a,1)
(a,4)
(a,3)

Upvotes: 0

Related Questions