spark-xml problem with encoding windows-1251

Question

I have a problem with parsing an XML document in pyspark using spark-xml API (pyspark 2.4.0). I have a file with cyryllic content with the following opening tag:

So when I try to open it with some text editor with windows-1251 encoding, I can see the cyryllic text. But when I try to read an XML file into spark dataframe using this command:

df = spark.read.format('xml').options(rootTag='products', rowTag='product', charset="cp1251").load('./cp1251')

I see strange symbols instead of cyryllic. But when I convert this file to UTF-8 encoding, the same command works correctly and I get spark dataframe with cyryllic symbols. I find this fact strange, because opening tag points on cp1251 encoding and content in this encoding is actually written in this encoding, but it works correctly only when it is interpreted as UTF-8.

My question: is there any ideas why does it happen? Is there a way to read a cp1251 xml file directly into dataframe without converting it to UTF-8? If there is no way, I would like to know about reasons of this behaviour.

P. S. If it is important, XML file is nested, cyryllic content is somewhere inside several levels of product-block.

P. P. S. If I give no information about encoding in SparkDataframeReader options, it will read file in UTF-8, don't taking into account an information in the opening tag. It is also strange, if you know something about it, please give me some piece of information.

spark-xml problem with encoding windows-1251

Answers (1)

Related Questions