Reputation: 131
I have the following XML-response(s) via SRU in dublin core (of which there are several, this is one example):
<dc:title>Die EU im Einsatz gegen den Klimawandel : der EU-Emissionshandel - ein offenes System, das weltweit Innovationen fördert / [Europäische Kommission]</dc:title>
<dc:creator>Europäische Kommission</dc:creator>
<dc:publisher>[Luxemburg] : [Amt für Amtliche Veröff. der Europ. Gemeinschaften]</dc:publisher>
<dc:date>2005</dc:date>
<dc:language>ger</dc:language>
<dc:identifier xmlns:tel="http://krait.kb.nl/coop/tel/handbook/telterms.html" xsi:type="tel:ISBN">92-894-9187-6 geh.</dc:identifier>
<dc:identifier xsi:type="dnb:IDN">992017882</dc:identifier>
<dc:subject>360 Soziale Probleme, Sozialdienste, Versicherungen</dc:subject>
<dc:subject>330 Wirtschaft</dc:subject>
<dc:format>20 S.</dc:format>
</dc></recorddata><recordposition>3</recordposition></record>
I am trying to address the element <dc:identifier xsi:type="dnb:IDN">992017882</dc:identifier>, but I seem to be unable to properly do this. Since I have several of these records and some have 2, some 1, some 3 or more dc:identifier versions, I am working with a function to get the content of the xml-tags I require and am parsing it to a dataframe afterwards. This works well for elements such as dc:title, but the moment I need to also address the attributes, I am at a loss. I have tried various things, but seem to have an issue with the fact that I need to adress two namespaces (?). The current function looks like this:
def parse_record(record):
ns = {"dc": "http://purl.org/dc/elements/1.1/"}
xml = ET.fromstring(unicodedata.normalize("NFC", str(record)))
#idn
idn = xml.xpath(".//dc:identifier[@xsi:type='dnb:IDN']", namespaces=ns)
try:
idn = idn.text
except:
idn = 'fail'
# titel
titel = xml.xpath('.//dc:title', namespaces=ns)
try:
titel = titel[0].text
#titel = unicodedata.normalize("NFC", titel)
except:
titel = "unkown"
meta_dict = {"idn":idn, "titel":titel}
return meta_dict
I can run the function without any problems, but when I try to parse the response into a dataframe with the following code:
output = [parse_record(record) for record in records]
df = pd.DataFrame(output)
df
I get the error message: "XPathEvalError: Undefined namespace prefix"
Can anyone help?
Upvotes: 0
Views: 90
Reputation: 4462
As pointed out in comments dictionary containing namespace declarations should include definition for xsi
prefix as well:
ns = {
"dc": "http://purl.org/dc/elements/1.1/",
# should be changed depending on the namespace
"xsi": "http://www.w3.org/2001/XMLSchema-instance"
}
Upvotes: 1