Parsing XML files with repeated tags that have differing data using BeautifulSoup in Python

Question

I have been stuck on this problem for a while now but no solution. I have a snippet of my Python script that looks like so:

pub_ref = soup.findAll("publication-reference") 

with open('./output.csv', 'ab+') as f:
    writer = csv.writer(f, dialect = 'excel')

    for info in pub_ref:  
        pat_cite = soup.findAll("patcit")
        for item in pat_cite:
            if item.find("name"):
                name = item.find("name").text

            writer.writerow([name])

This part of the script I want to parse children of a citation child root "pacit" of the parent "publication-reference" that crops up multiple times in the XML file and looks like this:

.
.
.
    
    
    
    
    US
    1589850
    A
    Haskell
    19260600
    
    
    cited by applicant
    
    
    
    
    US
    D134414
    S
    Orme, Jr.
    19421100
    
    
    cited by applicant
    
    
.
.
.

The dots indicate that the file is larger than this and isn't showing the parent root "publication-reference". The problem is that my script only parses one of the many children of pacit, the "name" root, once through as you can tell. And this works fine for those roots that have only one entry per invention, but not multiples.

I also want to store these in an CSV file, as you can see with the writer, whereby the output shows these multiple patcit citations down a column like so:

invention name  country   city      .... patcit name1  patcit date1....
                  white space            patcit name2  patcit date2....
                  white space            patcit name2  patcit date3....

The XML files I'm using can be found here at https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2017/

Any help would be appreciated as I've tried multiple ways and I feel this is a beginner's problem.

Dan-Dev · Accepted Answer

First of all I downloaded one of the zip files "ipg170103.zip" and found it contained multiple xml documents. So I ran (on Linux)

csplit ipg170103.xml '/xml version/' '{*}'

To split the files into multiple single documents. Working with one of these files "xx995" I managed to see what you are working with. using "grep" on the file for "country" I discovered many instances of the word so I guessed you wanted the "country" under "publication-reference" (if not you will have to change the script) and likewise "invention" from "invention-title". I also discovered multiple instances of "date" under "patcit" not all of them had a name with them so my script omits these. I found too many "city" elements to know which one you wanted. But in any case I could not determine exactly what you wanted so you may well have to tweak it a bit for your exact needs.

from bs4 import BeautifulSoup
import csv

xml = open("xx995",'r').read()
soup = BeautifulSoup(xml, 'lxml')
pat = soup.find("us-patent-grant")

country = pat.find("publication-reference").find("country").text
invention = pat.find("invention-title").text

data = []
pat_cite = pat.findAll("patcit")
for item in pat_cite:
    name = None
    date = None
    if item.find("name"):
        name = item.find("name").text
        # Only get date if name
        if item.find("date"):
            date = item.find("date").text
        data.append((name,date))

with open('./output.csv', 'wt') as f:
    writer = csv.writer(f, dialect='excel')
    writer.writerow(('invention', 'country', 'patcit name', 'patcit date'))
    for d in data:
        writer.writerow((invention, country, d[0], d[1]))
        invention = None
        country = None

Outputs:

Parsing XML files with repeated tags that have differing data using BeautifulSoup in Python

Answers (1)

Related Questions