spark-xml problem with encoding windows-1251

I have a problem with parsing an XML document in pyspark using spark-xml API (pyspark 2.4.0). I have a file with cyryllic content with the following opening tag:

<?xml version='1.0' encoding='WINDOWS-1251'?>

So when I try to open it with some text editor with windows-1251 encoding, I can see the cyryllic text. But when I try to read an XML file into spark dataframe using this command:

df = spark.read.format('xml').options(rootTag='products', rowTag='product', charset="cp1251").load('./cp1251')

I see strange symbols instead of cyryllic. But when I convert this file to UTF-8 encoding, the same command works correctly and I get spark dataframe with cyryllic symbols. I find this fact strange, because opening tag points on cp1251 encoding and content in this encoding is actually written in this encoding, but it works correctly only when it is interpreted as UTF-8.

My question: is there any ideas why does it happen? Is there a way to read a cp1251 xml file directly into dataframe without converting it to UTF-8? If there is no way, I would like to know about reasons of this behaviour.

P. S. If it is important, XML file is nested, cyryllic content is somewhere inside several levels of product-block.

P. P. S. If I give no information about encoding in SparkDataframeReader options, it will read file in UTF-8, don't taking into account an information in the opening tag. It is also strange, if you know something about it, please give me some piece of information.

Upvotes: 0

Views: 474

Answers (1)

Well, in fact it was neccesary to use utf-8 encoding while working with utf-8 document... And the reason of bad working with non-utf encodings is in Hadoop.Text class feature: it supports only UTF-8 texts. (Hadoop.Text docs)

So, if you work with spark-xml, first of all you need to convert your file to UTF-8 encoding.

Upvotes: 0

Related Questions