Reputation: 103
I have a xml file where I want to extract data from. I tried using python, but when I try to use an example script I found online, I can't extract the data I want. I found a script at: https://medium.com/analytics-vidhya/parsing-xml-files-in-python-d7c136bb9aa5 it is able to extract the data from the given example, but when I try to apply this to my xml, I can't get it to work. Here is my xml file: https://pastebin.com/Q4HTYacM (it is ~200 lines long, which is why I pasted it to pastebin)
The Data I'm interested in can be found in the
<datafield tag="100" ind1="1" ind2=" "> <!--VerfasserIn-->
<subfield code="a">Ullenboom, Christian</subfield>
<subfield code="e">VerfasserIn</subfield>
<subfield code="0">(DE-588)123404738</subfield>
<subfield code="0">(DE-627)502584122</subfield>
<subfield code="0">(DE-576)184619254</subfield>
<subfield code="4">aut</subfield>
</datafield>
Field, as well as some others. The problem is, that I'm interested in <subfield code="a">Ullenboom, Christian</subfield>
but I can't get it to extract, as it seems like the root=tree.getroot()
only counts the first line as a searchable line and I haven't found any way to search for the specific datafields.
Any Help is appreciated.
Edit: My Script:
## source: https://medium.com/analytics-vidhya/parsing-xml-files-in-python-d7c136bb9aa5
# import libs
import pandas as pd
import numpy as np
import glob
import xml.etree.cElementTree as et
#parse the file
tree=et.parse(glob.glob('./**/*baselinexml.xml',recursive=True)[0])
root=root.getroot()
#create list for values
creator = []
titlebook = []
VerfasserIn = []
# Converting the data
for creator in root.iter('datafield tag="100" '):
print(creator)
print("step1")
creator.append(VerfasserIn)
for titlebook in root.iter('datafield tag="245" ind1="1" ind2="0"'):
print(titlebook.text)
# creating dataframe
Jobs_df = pd.DataFrame(
list(zip(creator, titlebook)),
columns=['creator','titlebook'])
#saving as .csv
Jobs_df.to_csv("sample-api1.csv")
I'm fairly new to this kind of programming, so I tried modifying the code from the example
Upvotes: 1
Views: 6230
Reputation: 41127
Listing [Python.Docs]: xml.etree.ElementTree - The ElementTree XML API, you will find everything you need to know there.
The XML you posted on PasteBin is not complete (and therefore invalid). It's lacking </zs:recordData></zs:record></zs:records></zs:searchRetrieveResponse>
at the end.
The presence of namespaces makes things complicated, as the (real) node tags are not the literal strings from the file. You should insist on namespaces, and also on XPath in the above URL.
Here's a variant.
code00.py:
#!/usr/bin/env python
import sys
from xml.etree import ElementTree as ET
def main(*argv):
doc = ET.parse("./blob.xml") # Saved (and corrected) the file from PasteBin
root = doc.getroot()
namespaces = { # Manually extracted from the XML file, but there could be code written to automatically do that.
"zs": "http://www.loc.gov/zing/srw/",
"": "http://www.loc.gov/MARC21/slim",
}
#print(root)
datafield_nodes_path = "./zs:records/zs:record/zs:recordData/record/datafield" # XPath
datafield_attribute_filters = [
{
"tag": "100",
"ind1": "1",
"ind2": " ",
},
{
"tag": "245",
"ind1": "1",
"ind2": "0",
},
]
#datafield_attribute_filters = [] # Decomment this line to clear filters (and process each datafield node)
ret = []
for datafield_node in root.iterfind(datafield_nodes_path, namespaces=namespaces):
if datafield_attribute_filters:
skip_node = True
for attr_dict in datafield_attribute_filters:
for k, v in attr_dict.items():
if datafield_node.get(k) != v:
break
else:
skip_node = False
break
if skip_node:
continue
for subfield_node in datafield_node.iterfind("./subfield[@code='a']", namespaces=namespaces):
ret.append(subfield_node.text)
print("Results:")
for i, e in enumerate(ret, start=1):
print("{:2d}: {:s}".format(i, e))
if __name__ == "__main__":
print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
64 if sys.maxsize > 0x100000000 else 32, sys.platform))
rc = main(*sys.argv[1:])
print("\nDone.")
sys.exit(rc)
Output:
[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q071724477]> "e:\Work\Dev\VEnvs\py_pc064_03.09_test0\Scripts\python.exe" code00.py Python 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)] 064bit on win32 Results: 1: Ullenboom, Christian 2: Java ist auch eine Insel Done.
In the above example, I extracted the text of every subfield node (with code attribute having a value of a) which is a child of a datafield node that has the attributes matching one of the entries in the datafield_attribute_filters list (got the attributes filtering later, from your script).
You can do some more filtering if you need to.
Upvotes: 1