desperatecoder
desperatecoder

Reputation: 187

Python read xml file with Chinese characters

This is my example xml file

<ROOT><RECORD><設立案號>066143470</設立案號><登記編號>4927872</登記編號><工廠名稱>公司名稱</工廠名稱><工廠地址>工廠地址</工廠地址></RECORD></ROOT>

The problem I'm facing is after I read it into BeautifulSoup:

soup = BeautifulSoup (open("info.xml"), features="lxml")
page = soup.html.root
print(page.prettify())

The result I got is

<root<record>066143470\u8a2d\u7acb\u6848\u865f&gt;4927872\u767b\u8a18\u7de8\u865f&gt;\u516c\u53f8\u540d\u7a31\u5de5\u5ee0\u540d\u7a31&gt;\u5de5\u5ee0\u5730\u5740\u5de5\u5ee0\u5730\u5740&gt;</record></root>

Basically, the setting of the file is really messed up. How can I read in a file with all the Chinese character and structure preserved?

Thanks in advance.

Upvotes: 1

Views: 262

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195428

Use xml parser, not lxml:

txt = '''<ROOT><RECORD><設立案號>066143470</設立案號><登記編號>4927872</登記編號><工廠名稱>公司名稱</工廠名稱><工廠地址>工廠地址</工廠地址></RECORD></ROOT>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'xml')
print(soup.prettify())

Prints:

<?xml version="1.0" encoding="utf-8"?>
<ROOT>
 <RECORD>
  <設立案號>
   066143470
  </設立案號>
  <登記編號>
   4927872
  </登記編號>
  <工廠名稱>
   公司名稱
  </工廠名稱>
  <工廠地址>
   工廠地址
  </工廠地址>
 </RECORD>
</ROOT>

Upvotes: 1

Related Questions