Reputation: 73
I am parsing in an XML file with special characters from foreign languages in some of the author names (í = í , ï = ï , ò = ò etc)
. My code gets caught up with an error "ExpatError: undefined entity:" when trying to process these characters. I have seen BeautifulSoup library online, but unsure how to easily implement that into my code without having to rewrite using the lxml library (if my understanding is correct). What is the best way to solve this? Cheers!
XML data to load
<pub>
<ID>75</ID>
<title>Use of Lexicon Density in Evaluating Word Recognizers</title>
<year>2000</year>
<booktitle>Multiple Classifier Systems</booktitle>
<pages>310-319</pages>
<authors>
<author>Petr Slavík</author>
<author>Venu Govindaraju</author>
</authors>
</pub>
Python code
import sqlite3
con = sqlite3.connect("publications.db")
cur = con.cursor()
from xml.dom import minidom
xmldoc = minidom.parse("test.xml")
#loop through <pub> tags to find number of pubs to grab
root = xmldoc.getElementsByTagName("root")[0]
pubs = [a.firstChild.data for a in root.getElementsByTagName("pub")]
num_pubs = len(pubs)
count = 0
while(count < num_pubs):
#get data from each <pub> tag
temp_pub = root.getElementsByTagName("pub")[count]
temp_ID = temp_pub.getElementsByTagName("ID")[0].firstChild.data
temp_title = temp_pub.getElementsByTagName("title")[0].firstChild.data
temp_year = temp_pub.getElementsByTagName("year")[0].firstChild.data
temp_booktitle = temp_pub.getElementsByTagName("booktitle")[0].firstChild.data
temp_pages = temp_pub.getElementsByTagName("pages")[0].firstChild.data
temp_authors = temp_pub.getElementsByTagName("authors")[0]
temp_author_array = [a.firstChild.data for a in temp_authors.getElementsByTagName("author")]
num_authors = len(temp_author_array)
count = count + 1
#process results into sqlite
pub_params = (temp_ID, temp_title)
cur.execute("INSERT INTO publication (id, ptitle) VALUES (?, ?)", pub_params)
journal_params = (temp_booktitle, temp_pages, temp_year)
cur.execute("INSERT INTO journal (jtitle, pages, year) VALUES (?, ?, ?)", journal_params)
x = 0
while(x < num_authors):
cur.execute("INSERT OR IGNORE INTO authors (name) VALUES (?)", (temp_author_array[x],))
x = x + 1
#display results
print("\nEntry processed: ", count)
print("------------------\nPublication ID: ", temp_ID)
print("Publication Title: ", temp_title)
print("Year: ", temp_year)
print("Journal title: ", temp_booktitle)
print("Pages: ", temp_pages)
i = 0
print("Authors: ")
while(i < num_authors):
print("-",temp_author_array[i])
i = i + 1
con.commit()
con.close()
print("\nNumber of entries processed: ", count)
Upvotes: 0
Views: 2144
Reputation: 3271
The trick is to use html.unescape to convert the html5 entities to their unicode characters and then escaping the xml syntax characters back so that the standard xml parser can read them as text.
#! /usr/bin/python3
import re
import xml.dom.minidom
from html import escape, unescape
def minidom_parseHtml(text: str):
"parse html text with non-xml html-entities as minidom"
textXML = re.sub("\\&\\w+\\;", lambda x: escape(unescape(x.group(0))), text)
return xml.dom.minidom.parseString(textXML)
Upvotes: 0
Reputation: 2688
.encode('UTF-8') #Add to your code at the end of the example
UTF-8 Has the support for most of these characters following, should work, Add :
xmldoc = minidom.parse("test.xml")
NewXML = xmldoc.encode('utf-8', 'ignore')
Upvotes: 1
Reputation: 1701
You may decode the data you have extracted first, by simply import html
if you are using python3.x
Convert all named and numeric character references (e.g. >, >, &x3e;) in the string s to the corresponding unicode characters.
>>import html
>>print(html.unescape("Petr Slavík"))
Petr Slavík
Seems the html-safe character cannot be parsed and returned as Document object by minidom, you have to read the file and decode it, then send as a string to the module, as the following code.
Return a Document that represents the string.
file_text = html.unescape(open('text.xml', 'r').read())
xmldoc = minidom.parseString(file_text)
Upvotes: 1