Parsing XML repeated child root in Python using BeautifulSoup

Question

So I've run into an issue where I've been parsing an XML file like so:

soup = BeautifulSoup(xml_string, "lxml")  
pub_ref = soup.findAll("publication-reference") 

with open('./output.csv', 'ab+') as f:
    writer = csv.writer(f, dialect = 'excel')

    for info in pub_ref:  
        assign = soup.findAll("assignee")
        pat_cite = soup.findAll("patcit")

        for item1 in assign: 
            if item.find("orgname"):
                org_name = item.find("orgname").text

        for item2 in pat_cite:
            if item2.find("name"):
                name = item2.find("name").text


        for inv_name, pat_num, cpc_num, class_num, subclass_num, date_num, country, city, state in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("section"), soup.findAll("class"), soup.findAll("subclass"), soup.findAll("date"), soup.findAll("country"), soup.findAll("city"), soup.findAll("state")):

            writer.writerow([inv_name.text, pat_num.text, org_name, cpc_num.text, class_num.text, subclass_num.text, date_num.text, country.text, city.text, state.text, name])

I was limited to only a few elements (as shown in the text entries at the end) but I now have about 10 more parent elements with over 30 more child elements I need to parse so explicitly stating them all out like this won't really work well anymore. Also, I have repeats in the data which looks like:





US
1589850
A
Haskell
19260600


cited by applicant




US
D134414
S
Orme, Jr.
19421100


cited by applicant

I would like this to be able to parse repeated child roots (such as patcit) into my CSV file as columns like so:

invention name  country   city  .... patcit name1  patcit date1....
              white space            patcit name2  patcit date2....
              white space            patcit name2  patcit date3....

And so on....because each invention has more than one citation or reference it will have only one column of most of the other information.

Parsing XML repeated child root in Python using BeautifulSoup

Answers (1)

Related Questions