douglasrcjames_old
douglasrcjames_old

Reputation: 73

How to read in special characters to Python

I am parsing in an XML file with special characters from foreign languages in some of the author names (í = í , ï = ï , ò = ò etc). My code gets caught up with an error "ExpatError: undefined entity:" when trying to process these characters. I have seen BeautifulSoup library online, but unsure how to easily implement that into my code without having to rewrite using the lxml library (if my understanding is correct). What is the best way to solve this? Cheers!

XML data to load

<pub>
    <ID>75</ID>
    <title>Use of Lexicon Density in Evaluating Word Recognizers</title>
    <year>2000</year>
    <booktitle>Multiple Classifier Systems</booktitle>
    <pages>310-319</pages>
    <authors>
        <author>Petr Slav&iacute;k</author>
        <author>Venu Govindaraju</author>
    </authors>
</pub>

Python code

import sqlite3
con = sqlite3.connect("publications.db")
cur = con.cursor()

from xml.dom import minidom

xmldoc = minidom.parse("test.xml")

#loop through <pub> tags to find number of pubs to grab
root = xmldoc.getElementsByTagName("root")[0]
pubs = [a.firstChild.data for a in root.getElementsByTagName("pub")]
num_pubs = len(pubs)
count = 0

while(count < num_pubs):

    #get data from each <pub> tag
    temp_pub = root.getElementsByTagName("pub")[count]
    temp_ID = temp_pub.getElementsByTagName("ID")[0].firstChild.data
    temp_title = temp_pub.getElementsByTagName("title")[0].firstChild.data
    temp_year = temp_pub.getElementsByTagName("year")[0].firstChild.data
    temp_booktitle = temp_pub.getElementsByTagName("booktitle")[0].firstChild.data
    temp_pages = temp_pub.getElementsByTagName("pages")[0].firstChild.data
    temp_authors = temp_pub.getElementsByTagName("authors")[0]
    temp_author_array = [a.firstChild.data for a in temp_authors.getElementsByTagName("author")]
    num_authors = len(temp_author_array)
    count = count + 1


    #process results into sqlite
    pub_params = (temp_ID, temp_title)
    cur.execute("INSERT INTO publication (id, ptitle) VALUES (?, ?)", pub_params)
    journal_params = (temp_booktitle, temp_pages, temp_year)
    cur.execute("INSERT INTO journal (jtitle, pages, year) VALUES (?, ?, ?)", journal_params)
    x = 0
    while(x < num_authors):
        cur.execute("INSERT OR IGNORE INTO authors (name) VALUES (?)", (temp_author_array[x],))
        x = x + 1

    #display results
    print("\nEntry processed: ", count)
    print("------------------\nPublication ID: ", temp_ID)
    print("Publication Title: ", temp_title)
    print("Year: ", temp_year)
    print("Journal title: ", temp_booktitle)
    print("Pages: ", temp_pages)
    i = 0
    print("Authors: ")
    while(i < num_authors):
        print("-",temp_author_array[i])
        i = i + 1

con.commit()
con.close()    

print("\nNumber of entries processed: ", count)  

Upvotes: 0

Views: 2144

Answers (3)

Guido U. Draheim
Guido U. Draheim

Reputation: 3271

The trick is to use html.unescape to convert the html5 entities to their unicode characters and then escaping the xml syntax characters back so that the standard xml parser can read them as text.

#! /usr/bin/python3
import re
import xml.dom.minidom
from html import escape, unescape

def minidom_parseHtml(text: str):
     "parse html text with non-xml html-entities as minidom"
     textXML = re.sub("\\&\\w+\\;", lambda x: escape(unescape(x.group(0))), text)
     return xml.dom.minidom.parseString(textXML)

Upvotes: 0

innicoder
innicoder

Reputation: 2688

.encode('UTF-8') #Add to your code at the end of the example

UTF-8 Has the support for most of these characters following, should work, Add :

xmldoc = minidom.parse("test.xml")
NewXML = xmldoc.encode('utf-8', 'ignore')

Upvotes: 1

M. Leung
M. Leung

Reputation: 1701

You may decode the data you have extracted first, by simply import html if you are using python3.x

html.unescape(s)

Convert all named and numeric character references (e.g. >, >, &x3e;) in the string s to the corresponding unicode characters.

>>import html
>>print(html.unescape("Petr Slav&iacute;k"))

Petr Slavík

Seems the html-safe character cannot be parsed and returned as Document object by minidom, you have to read the file and decode it, then send as a string to the module, as the following code.

xml.dom.minidom.parseString(string[, parser])

Return a Document that represents the string.

file_text = html.unescape(open('text.xml', 'r').read())
xmldoc = minidom.parseString(file_text)

Upvotes: 1

Related Questions