Reputation: 11
I am using Apache Pig to process some data.
My data set has some strings that contain special characters i.e (#,{}[])
.
This programming pig book says that you can't escape those characters.
So how can I process my data without deleting the special characters?
I thought about replacing them but would like to avoid that.
Thanks
Upvotes: 1
Views: 4781
Reputation: 9073
Easiest way would be,
input = LOAD 'inputLocation' USING TextLoader() as unparsedString:chararray;
TextLoader just reads each line of input into a String regardless of what's inside that string. You could then use your own parsing logic.
Upvotes: 1
Reputation: 7571
When writing your loader function, instead of returning tuples with e.g. maps as a String (and thus later relying on Utf8StorageConverter to get the conversion to a map right):
Tuple tuple = tupleFactory.newTuple( 1 );
tuple.set(0, new DataByteArray("[age#22, name#joel]"));
you can create and set directly a Java map:
HashMap<String, Object> map = new HashMap<String, Object>(2);
map.put("age", 22);
map.put("name", "joel");
tuple.set(0, map);
This is useful especially if you have to do the parsing during loading anyway.
Upvotes: 0
Reputation: 5801
Have you tried loading your data? There is no way to escape these characters when they are part of the values in a tuple, bag, or map, but there is no problem whatsoever in loading these characters in when part of a string. Just specify that field as type chararray
.
The only issue you will have to watch out for here is if your strings ever contain the character that Pig is using as field delimiter - for example, if you are USING PigStorage(',')
and your strings contain commas. But as long as you are not telling Pig to parse your field as a map, #
, [
, and ]
will be handled just fine.
Upvotes: 1