Reputation: 394
I am trying to process a file which contains a lot of special characters such as German umlauts(ä,ü,o) etc. as follows :
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "\r\n\r\n")
sc.textFile("/file/path/samele_file.txt")
But upon reading the contents, these special characters are not recognized.
I think the default encoding is not in UTF-8 or similar formats. I would like to know if there is a way to set encoding on this textFile method such as:
sc.textFile("/file/path/samele_file.txt",mode="utf-8")`
Upvotes: 6
Views: 12085
Reputation: 98
No, if you read a non UTF-8 format file in UTF-8 mode, non-ascii characters will not be decoded properly. Please convert file to UTF-8 encoding and then read. You can refer to Reading file in different formats
context.hadoopFile[LongWritable, Text, TextInputFormat](location).map(
pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)
)
Upvotes: 2
Reputation: 1954
Default mode is UTF-8. You don't need to specify format explicitly for UTF-8. If it's a non UTF-8 then it depends if you need to read those unsupported characters or not
Upvotes: 1