WorldTeacher
WorldTeacher

Reputation: 103

extract data from xml subfields using python

I have a xml file where I want to extract data from. I tried using python, but when I try to use an example script I found online, I can't extract the data I want. I found a script at: https://medium.com/analytics-vidhya/parsing-xml-files-in-python-d7c136bb9aa5 it is able to extract the data from the given example, but when I try to apply this to my xml, I can't get it to work. Here is my xml file: https://pastebin.com/Q4HTYacM (it is ~200 lines long, which is why I pasted it to pastebin)

The Data I'm interested in can be found in the

<datafield  tag="100" ind1="1" ind2=" "> <!--VerfasserIn-->
   <subfield code="a">Ullenboom, Christian</subfield>
   <subfield code="e">VerfasserIn</subfield>
   <subfield code="0">(DE-588)123404738</subfield>
   <subfield code="0">(DE-627)502584122</subfield>
   <subfield code="0">(DE-576)184619254</subfield>
   <subfield code="4">aut</subfield>
</datafield>

Field, as well as some others. The problem is, that I'm interested in <subfield code="a">Ullenboom, Christian</subfield> but I can't get it to extract, as it seems like the root=tree.getroot() only counts the first line as a searchable line and I haven't found any way to search for the specific datafields.

Any Help is appreciated.

Edit: My Script:

## source: https://medium.com/analytics-vidhya/parsing-xml-files-in-python-d7c136bb9aa5

# import libs
import pandas as pd
import numpy as np
import glob

import xml.etree.cElementTree as et


#parse the file
tree=et.parse(glob.glob('./**/*baselinexml.xml',recursive=True)[0])

root=root.getroot()
#create list for values
creator = []
titlebook = []
VerfasserIn = []

# Converting the data
for creator in root.iter('datafield tag="100" '):
    print(creator)
print("step1")
creator.append(VerfasserIn)
for titlebook in root.iter('datafield tag="245" ind1="1" ind2="0"'):
    print(titlebook.text)


# creating dataframe

Jobs_df = pd.DataFrame(
list(zip(creator, titlebook)),
columns=['creator','titlebook'])
#saving as .csv

Jobs_df.to_csv("sample-api1.csv")

I'm fairly new to this kind of programming, so I tried modifying the code from the example

Upvotes: 1

Views: 6230

Answers (1)

CristiFati
CristiFati

Reputation: 41127

Listing [Python.Docs]: xml.etree.ElementTree - The ElementTree XML API, you will find everything you need to know there.

The XML you posted on PasteBin is not complete (and therefore invalid). It's lacking </zs:recordData></zs:record></zs:records></zs:searchRetrieveResponse> at the end.

The presence of namespaces makes things complicated, as the (real) node tags are not the literal strings from the file. You should insist on namespaces, and also on XPath in the above URL.

Here's a variant.

code00.py:

#!/usr/bin/env python

import sys
from xml.etree import ElementTree as ET


def main(*argv):
    doc = ET.parse("./blob.xml")  # Saved (and corrected) the file from PasteBin
    root = doc.getroot()
    namespaces = {  # Manually extracted from the XML file, but there could be code written to automatically do that.
        "zs": "http://www.loc.gov/zing/srw/",
        "": "http://www.loc.gov/MARC21/slim",
    }
    #print(root)
    datafield_nodes_path = "./zs:records/zs:record/zs:recordData/record/datafield"  # XPath
    datafield_attribute_filters = [
        {
            "tag": "100",
            "ind1": "1",
            "ind2": " ",
        },
        {
            "tag": "245",
            "ind1": "1",
            "ind2": "0",
        },
    ]
    #datafield_attribute_filters = []  # Decomment this line to clear filters (and process each datafield node)
    ret = []
    for datafield_node in root.iterfind(datafield_nodes_path, namespaces=namespaces):
        if datafield_attribute_filters:
            skip_node = True
            for attr_dict in datafield_attribute_filters:
                for k, v in attr_dict.items():
                    if datafield_node.get(k) != v:
                        break
                else:
                    skip_node = False
                    break
            if skip_node:
                continue
        for subfield_node in datafield_node.iterfind("./subfield[@code='a']", namespaces=namespaces):
            ret.append(subfield_node.text)
    print("Results:")
    for i, e in enumerate(ret, start=1):
        print("{:2d}: {:s}".format(i, e))


if __name__ == "__main__":
    print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
                                                   64 if sys.maxsize > 0x100000000 else 32, sys.platform))
    rc = main(*sys.argv[1:])
    print("\nDone.")
    sys.exit(rc)

Output:

[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q071724477]> "e:\Work\Dev\VEnvs\py_pc064_03.09_test0\Scripts\python.exe" code00.py
Python 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)] 064bit on win32

Results:
 1: Ullenboom, Christian
 2: Java ist auch eine Insel

Done.

In the above example, I extracted the text of every subfield node (with code attribute having a value of a) which is a child of a datafield node that has the attributes matching one of the entries in the datafield_attribute_filters list (got the attributes filtering later, from your script).
You can do some more filtering if you need to.

Upvotes: 1

Related Questions