Noxos
Noxos

Reputation: 21

xml file from ISO-8859-2 to UTF-8 in python

I need your help to resolve an encoding issue as it seems.

I have a lot of input files that have the same pattern has this below :

<?xml version='1.0' encoding='iso-8859-1'?>
  <root>
    <Module name="ModuleName">
      <Parameter Value="Data01$|Data02F1F5$|Data03:$|Data04 : $|"/>
    </Module>
  </root>

I need to be able to parse the file but there is a lot of special characters you can see below :

enter image description here

I can't use lxml or beautiful soup.

I tried the different options below but I couldn't find the solution :

from  xml.etree import ElementTree

file = 'StackOverflow.xml'

with open(file, 'r', encoding = 'iso-8859-1') as f:
    string = f.read()
    print(string)
with open(file, 'w', encoding = 'utf-8') as f:
    f.write(string)
    
with open(file, 'rb') as f :
    root = ElementTree.fromstring(f.read())

tree = ElementTree.ElementTree(root)
tree.write(file, encoding='utf-8', xml_declaration = True)

with open(file, 'rb') as f:
    parser = etree.XMLParser(encoding = "iso-8859-1")
    root = etree.parse(f, parser)
      
string = etree.tostring(root, xml_declaration = True, encoding="utf-8").decode('utf-8').encode('iso-8859-1')

with open('file', 'wb') as f:
    target.write(string)

Upvotes: -1

Views: 113

Answers (1)

Hermann12
Hermann12

Reputation: 3591

I can't reproduce your problem:

import xml.etree.ElementTree as ET

xml_file_path = "StackOverFlow.xml"

tree = ET.parse(xml_file_path)
root = tree.getroot()

for elem in root.iter():
    print(elem.tag, elem.attrib)

Output:

root {}
Module {'name': 'ModuleName'}
Parameter {'Value': 'Data01$|Data02F1F5$|Data03:$|Data04 : $|'}

Your picture said utf-8:

import xml.etree.ElementTree as ET

#even with utf-8 it works:
xml_str = """<?xml version='1.0' encoding='utf-8'?>
<root>
  <Module name="ModuleName">
    <Parameter Value="Data01$|Data02F1F5$|Data03:$|Data04 : $|" />
  </Module>
</root>"""

root = ET.fromstring(xml_str)

for elem in root.iter():
    print(elem.tag, elem.attrib)

Output works also:

root {}
Module {'name': 'ModuleName'}
Parameter {'Value': 'Data01$|Data02F1F5$|Data03:$|Data04 : $|'}

Upvotes: 0

Related Questions