John Deer
John Deer

Reputation: 35

Difficulty with encoding while reading data in Spark

In connection with my earlier question, when I give the command,

filePath = sc.textFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.collect()

some part of the data has '\xa0' prefixed to every word, and other part of the data doesn't have that special character. I am attaching 2 pictures, one with '\xa0', and another without '\xa0'. With '\xa0' Without '\xa0' The content shown in 2 pictures belong to same file. Only some part of the data from same file is read that way by Spark. I have checked the original data file present in HDFS, and it was problem free.

I feel that it has something to do with encoding. I tried all methods like using replaceoption in flatMap like flatMap(lambda line: line.replace(u'\xa0', ' ').split(" ")), flatMap(lambda line: line.replace(u'\xa0', u' ').split(" ")), but none worked for me. This question might sound dump, but I am newbie in using Apache Spark, and I require some assistance to overcome this problem.

Can anyone please help me? Thanks in advance.

Upvotes: 0

Views: 4279

Answers (1)

Steven
Steven

Reputation: 15258

Check the encoding of your file. When you use sc.textFile, spark expects an UTF-8 encoded file. One of the solution is to acquire your file with sc.binaryFiles and then apply the expected encoding.

sc.binaryFile create a key/value rdd where key is the path to file and value is the content as a byte. If you need to keep only the text and apply an decoding function, :

filePath = sc.binaryFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.map(lambda x :x[1].decode('utf-8')) #or another encoding depending on your file

Upvotes: 2

Related Questions