Reputation: 35
In connection with my earlier question, when I give the command,
filePath = sc.textFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.collect()
some part of the data has '\xa0' prefixed to every word, and other part of the data doesn't have that special character. I am attaching 2 pictures, one with '\xa0', and another without '\xa0'.
The content shown in 2 pictures belong to same file. Only some part of the data from same file is read that way by Spark. I have checked the original data file present in HDFS, and it was problem free.
I feel that it has something to do with encoding. I tried all methods like using replace
option in flatMap like flatMap(lambda line: line.replace(u'\xa0', ' ').split(" "))
, flatMap(lambda line: line.replace(u'\xa0', u' ').split(" "))
, but none worked for me. This question might sound dump, but I am newbie in using Apache Spark, and I require some assistance to overcome this problem.
Can anyone please help me? Thanks in advance.
Upvotes: 0
Views: 4279
Reputation: 15258
Check the encoding of your file. When you use sc.textFile
, spark expects an UTF-8 encoded file.
One of the solution is to acquire your file with sc.binaryFiles
and then apply the expected encoding.
sc.binaryFile
create a key/value rdd where key is the path to file and value is the content as a byte.
If you need to keep only the text and apply an decoding function, :
filePath = sc.binaryFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.map(lambda x :x[1].decode('utf-8')) #or another encoding depending on your file
Upvotes: 2