Reputation: 33
i'm a beginner in writing Python and Pig UDF's and struggling with the charset conversion of an input file using PIG.
Browsed stackoverflow and the whole internet for days, tried a bunch of different things but I'm still helpless.
Hopefully someone can give me a gentle push into the right direction.
Environment: a real distributed Hadoop cluster (no local instance) / Cloudera configured with utf-8
and Apache Pig version 0.12.0
My source file is encoded with iso-8859-1
and the goal is to store it's content as utf-8
(after or before several other operations within pig).
The input file looks like this (a couple of ASCII/iso-8859-1 characters for testing purpose - one char each line):
ù
û
ü
ÿ
à
â
æ
ç
é
è
ê
ë
î
ô
This is my Pig-Script:
RMF $output;
REGISTER 'charsetConversion.py' using org.apache.pig.scripting.jython.JythonScriptEngine AS pyudf;
data = LOAD '$input' USING PigStorage() AS (col1:chararray); --col1:bytearray
final = foreach data generate $0, pyudf.toUTF8(col1);
STORE final INTO '$output' USING PigStorage();
And my UDF (written in python):
#!/usr/bin/env python
# charsetConversion.py
@outputSchema("word:chararray")
def toUTF8(s):
return unicode(s, 'iso-8859-1').encode('utf-8')
After running/submitting the script I got following output:
� �
� �
� �
� �
� �
� �
� �
� �
� �
� �
� �
� �
� �
� �
� �
� �
� �
� �
� �
� �
� �
In the second column i would expect the same readable values as in the input file and not �
.
What happens here?
Is my approach recommended at all?
What kind of other approaches does exist (without java Stackoverflow: Encoding in Pig - Java solution)?
Thanks a lot in advance and for any advises.
Upvotes: 3
Views: 1969
Reputation: 115
Here's the UDF defined in Python that worked for me for those of you who aren't comfortable with Scala:
#!/usr/bin/env python
# charsetConversion.py
import struct
@outputSchema("word:chararray")
def toUTF8(s):
line = ‘.’join([struct.pack(‘B’, x).decode(‘iso-8859-1’) for x in s])
return line
Here are the PIG commands to register the UDF, load the data, apply the UDF to the data, and get a sample of that data to check if the decoding worked as expected.
REGISTER 'charsetConversion.py' USING org.apache.pig.scripting.jython.JythonScriptEngine AS pyudf;
data = LOAD '$input' USING TextLoader AS (col1: bytearray);
final = FOREACH data GENERATE $0,pyudf.toUTF8(col1);
final_lim = LIMIT final 10;
DUMP final_lim;
As xhudik mentioned in his answer, the important part here is to define your field as a byte array.
Upvotes: 1
Reputation: 2444
i was solving a similar problem. Yes, the character encoding might be tricky in Java/Hadoop world :).
Actually, you were pretty close - the solution is:
data = LOAD '$input' USING PigStorage() AS (col1:bytearray);
The problem was that you specified col1 as a chararray. Chararray is "a array (string) in Unicode UTF-8 format". However, this is not true for your input data that are in iso-8859-1. The pig script interpreted your data as UTF-8 instead of iso-8859-1. You should specify bytearray instead. Bytearray do not interpret the data in any way - you have to do it (e.g. in your UDF function).
In the next, line we need to process each line:
parsed = foreach a generate flatten(my.testparser($0));
Then, within UDF function (my.testparser()) we will change iso-8859-1 encoding into utf-8:
....
val line:String = input.get(0).asInstanceOf[DataByteArray].get(), "windows-1250")
....
The code is in scala (I'm not a python guy - sorry), it takes input DataByteArray (bytearray in PIG) and get array of Scala.Bytes. Those bytes are interpreted as windows-1250 then.
That said, your UDF should stay the same, but the input needs to be changed into to bytearray in Pig and equivalent data type in your UDF
Hope this helps
Upvotes: 1