jOasis
jOasis

Reputation: 394

Spark: importing text file in UTF-8 encoding

I am trying to process a file which contains a lot of special characters such as German umlauts(ä,ü,o) etc. as follows :

sc.hadoopConfiguration.set("textinputformat.record.delimiter", "\r\n\r\n") sc.textFile("/file/path/samele_file.txt")

But upon reading the contents, these special characters are not recognized.

I think the default encoding is not in UTF-8 or similar formats. I would like to know if there is a way to set encoding on this textFile method such as:

sc.textFile("/file/path/samele_file.txt",mode="utf-8")`

Upvotes: 6

Views: 12085

Answers (2)

Shrey
Shrey

Reputation: 98

No, if you read a non UTF-8 format file in UTF-8 mode, non-ascii characters will not be decoded properly. Please convert file to UTF-8 encoding and then read. You can refer to Reading file in different formats

context.hadoopFile[LongWritable, Text, TextInputFormat](location).map(
    pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)
)

Upvotes: 2

pramesh
pramesh

Reputation: 1954

Default mode is UTF-8. You don't need to specify format explicitly for UTF-8. If it's a non UTF-8 then it depends if you need to read those unsupported characters or not

Upvotes: 1

Related Questions