Nirupreddy P
Nirupreddy P

Reputation: 23

PIG replace for multiple columns

I have a total of about 150 columns and want to search for \t and replace it with spaces

A = LOAD 'db.table' USING org.apache.hcatalog.pig.HCatLoader();
B = GROUP A ALL;
C = FOREACH B GENERATE REPLACE(B, '\\t', ' ');
STORE C INTO 'location';

This output is producing ALL the only word as output.

Is there a better way to replace all columns at once??

Thank you Nivi

Upvotes: 2

Views: 928

Answers (1)

o-90
o-90

Reputation: 17585

You could do this with a Python UDF. Say you had some data like this with tabs in it:

Data:

hi    there    friend,whats    up,nothing    much
yo    yo    yo,green    eggs, ham

You could write this in Python

UDF:

@outputSchema("datums:{(no_tabs:chararray)}")
def remove_tabs(columns):
    try:
        out = [tuple(map(lambda s: s.replace("\t", " "), x)) for x in columns]
        return out
    except:
        return [(None)]

and then in Pig

Query:

REGISTER 'remove_tabs.py' USING jython AS udf;
data = LOAD 'toy_data' USING PigStorage(',') AS (col0:chararray,
       , col1:chararray, col2:chararray);
grpd = GROUP data all;
A = FOREACH grpd GENERATE FLATTEN(udf.remove_tabs(data));
DUMP A;

Output:

(hi there friend,whats up,nothing much)
(yo yo yo,green eggs,ham)

Ovbiously you have more than three columns, but since you are grouping by all, the script should generalize to any number of columns.

Upvotes: 1

Related Questions