PIG UDF (Python) Charset Encoding

Question

i'm a beginner in writing Python and Pig UDF's and struggling with the charset conversion of an input file using PIG.

Browsed stackoverflow and the whole internet for days, tried a bunch of different things but I'm still helpless.

Hopefully someone can give me a gentle push into the right direction.

Environment: a real distributed Hadoop cluster (no local instance) / Cloudera configured with utf-8 and Apache Pig version 0.12.0

My source file is encoded with iso-8859-1 and the goal is to store it's content as utf-8 (after or before several other operations within pig).

The input file looks like this (a couple of ASCII/iso-8859-1 characters for testing purpose - one char each line):

ù
û
ü
ÿ
à
â
æ
ç
é
è
ê
ë
î
ô

This is my Pig-Script:

RMF $output;

REGISTER 'charsetConversion.py' using org.apache.pig.scripting.jython.JythonScriptEngine AS pyudf;

data = LOAD '$input' USING PigStorage() AS (col1:chararray); --col1:bytearray

final = foreach data generate $0, pyudf.toUTF8(col1);

STORE final INTO '$output' USING PigStorage();

And my UDF (written in python):

#!/usr/bin/env python
# charsetConversion.py

@outputSchema("word:chararray")
def toUTF8(s):
    return unicode(s, 'iso-8859-1').encode('utf-8')

After running/submitting the script I got following output:

�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½
�	ï¿½

In the second column i would expect the same readable values as in the input file and not ï¿½.

What happens here?

Is my approach recommended at all?

What kind of other approaches does exist (without java Stackoverflow: Encoding in Pig - Java solution)?

Thanks a lot in advance and for any advises.

xhudik · Accepted Answer

i was solving a similar problem. Yes, the character encoding might be tricky in Java/Hadoop world :).

Actually, you were pretty close - the solution is:

data = LOAD '$input' USING PigStorage() AS (col1:bytearray);

The problem was that you specified col1 as a chararray. Chararray is "a array (string) in Unicode UTF-8 format". However, this is not true for your input data that are in iso-8859-1. The pig script interpreted your data as UTF-8 instead of iso-8859-1. You should specify bytearray instead. Bytearray do not interpret the data in any way - you have to do it (e.g. in your UDF function).

In the next, line we need to process each line:

parsed = foreach a generate flatten(my.testparser($0));

Then, within UDF function (my.testparser()) we will change iso-8859-1 encoding into utf-8:

  ....
  val line:String = input.get(0).asInstanceOf[DataByteArray].get(), "windows-1250")
  ....

The code is in scala (I'm not a python guy - sorry), it takes input DataByteArray (bytearray in PIG) and get array of Scala.Bytes. Those bytes are interpreted as windows-1250 then.

That said, your UDF should stay the same, but the input needs to be changed into to bytearray in Pig and equivalent data type in your UDF

Hope this helps

PIG UDF (Python) Charset Encoding

Answers (2)

Related Questions