user887349
user887349

Reputation: 11

Escape special characters in Apache pig data

I am using Apache Pig to process some data.
My data set has some strings that contain special characters i.e (#,{}[]).

This programming pig book says that you can't escape those characters.

So how can I process my data without deleting the special characters?

I thought about replacing them but would like to avoid that.

Thanks

Upvotes: 1

Views: 4781

Answers (3)

DMulligan
DMulligan

Reputation: 9073

Easiest way would be,

input = LOAD 'inputLocation' USING TextLoader() as unparsedString:chararray;

TextLoader just reads each line of input into a String regardless of what's inside that string. You could then use your own parsing logic.

Upvotes: 1

Jakub Kotowski
Jakub Kotowski

Reputation: 7571

When writing your loader function, instead of returning tuples with e.g. maps as a String (and thus later relying on Utf8StorageConverter to get the conversion to a map right):

Tuple tuple = tupleFactory.newTuple( 1 );
tuple.set(0, new DataByteArray("[age#22, name#joel]"));

you can create and set directly a Java map:

HashMap<String, Object> map = new HashMap<String, Object>(2);
map.put("age", 22);
map.put("name", "joel");
tuple.set(0, map);

This is useful especially if you have to do the parsing during loading anyway.

Upvotes: 0

reo katoa
reo katoa

Reputation: 5801

Have you tried loading your data? There is no way to escape these characters when they are part of the values in a tuple, bag, or map, but there is no problem whatsoever in loading these characters in when part of a string. Just specify that field as type chararray.

The only issue you will have to watch out for here is if your strings ever contain the character that Pig is using as field delimiter - for example, if you are USING PigStorage(',') and your strings contain commas. But as long as you are not telling Pig to parse your field as a map, #, [, and ] will be handled just fine.

Upvotes: 1

Related Questions