Reputation: 23
I have a total of about 150 columns and want to search for \t and replace it with spaces
A = LOAD 'db.table' USING org.apache.hcatalog.pig.HCatLoader();
B = GROUP A ALL;
C = FOREACH B GENERATE REPLACE(B, '\\t', ' ');
STORE C INTO 'location';
This output is producing ALL the only word as output.
Is there a better way to replace all columns at once??
Thank you Nivi
Upvotes: 2
Views: 928
Reputation: 17585
You could do this with a Python UDF. Say you had some data like this with tabs in it:
Data:
hi there friend,whats up,nothing much
yo yo yo,green eggs, ham
You could write this in Python
UDF:
@outputSchema("datums:{(no_tabs:chararray)}")
def remove_tabs(columns):
try:
out = [tuple(map(lambda s: s.replace("\t", " "), x)) for x in columns]
return out
except:
return [(None)]
and then in Pig
Query:
REGISTER 'remove_tabs.py' USING jython AS udf;
data = LOAD 'toy_data' USING PigStorage(',') AS (col0:chararray,
, col1:chararray, col2:chararray);
grpd = GROUP data all;
A = FOREACH grpd GENERATE FLATTEN(udf.remove_tabs(data));
DUMP A;
Output:
(hi there friend,whats up,nothing much)
(yo yo yo,green eggs,ham)
Ovbiously you have more than three columns, but since you are grouping by all, the script should generalize to any number of columns.
Upvotes: 1