HelloToEarth
HelloToEarth

Reputation: 2127

Parsing XML files with repeated tags that have differing data using BeautifulSoup in Python

I have been stuck on this problem for a while now but no solution. I have a snippet of my Python script that looks like so:

pub_ref = soup.findAll("publication-reference") 

with open('./output.csv', 'ab+') as f:
    writer = csv.writer(f, dialect = 'excel')

    for info in pub_ref:  
        pat_cite = soup.findAll("patcit")
        for item in pat_cite:
            if item.find("name"):
                name = item.find("name").text

            writer.writerow([name])

This part of the script I want to parse children of a citation child root "pacit" of the parent "publication-reference" that crops up multiple times in the XML file and looks like this:

.
.
.
    <us-references-cited>
    <us-citation>
    <patcit num="00001">
    <document-id>
    <country>US</country>
    <doc-number>1589850</doc-number>
    <kind>A</kind>
    <name>Haskell</name>
    <date>19260600</date>
    </document-id>
    </patcit>
    <category>cited by applicant</category>
    </us-citation>
    <us-citation>
    <patcit num="00002">
    <document-id>
    <country>US</country>
    <doc-number>D134414</doc-number>
    <kind>S</kind>
    <name>Orme, Jr.</name>
    <date>19421100</date>
    </document-id>
    </patcit>
    <category>cited by applicant</category>
    </us-citation>
    <us-citation>
.
.
.

The dots indicate that the file is larger than this and isn't showing the parent root "publication-reference". The problem is that my script only parses one of the many children of pacit, the "name" root, once through as you can tell. And this works fine for those roots that have only one entry per invention, but not multiples.

I also want to store these in an CSV file, as you can see with the writer, whereby the output shows these multiple patcit citations down a column like so:

invention name  country   city      .... patcit name1  patcit date1....
                  white space            patcit name2  patcit date2....
                  white space            patcit name2  patcit date3....

The XML files I'm using can be found here at https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2017/

Any help would be appreciated as I've tried multiple ways and I feel this is a beginner's problem.

Upvotes: 0

Views: 347

Answers (1)

Dan-Dev
Dan-Dev

Reputation: 9440

First of all I downloaded one of the zip files "ipg170103.zip" and found it contained multiple xml documents. So I ran (on Linux)

csplit ipg170103.xml '/xml version/' '{*}'

To split the files into multiple single documents. Working with one of these files "xx995" I managed to see what you are working with. using "grep" on the file for "country" I discovered many instances of the word so I guessed you wanted the "country" under "publication-reference" (if not you will have to change the script) and likewise "invention" from "invention-title". I also discovered multiple instances of "date" under "patcit" not all of them had a name with them so my script omits these. I found too many "city" elements to know which one you wanted. But in any case I could not determine exactly what you wanted so you may well have to tweak it a bit for your exact needs.

from bs4 import BeautifulSoup
import csv

xml = open("xx995",'r').read()
soup = BeautifulSoup(xml, 'lxml')
pat = soup.find("us-patent-grant")

country = pat.find("publication-reference").find("country").text
invention = pat.find("invention-title").text

data = []
pat_cite = pat.findAll("patcit")
for item in pat_cite:
    name = None
    date = None
    if item.find("name"):
        name = item.find("name").text
        # Only get date if name
        if item.find("date"):
            date = item.find("date").text
        data.append((name,date))

with open('./output.csv', 'wt') as f:
    writer = csv.writer(f, dialect='excel')
    writer.writerow(('invention', 'country', 'patcit name', 'patcit date'))
    for d in data:
        writer.writerow((invention, country, d[0], d[1]))
        invention = None
        country = None

Outputs:

enter image description here

Upvotes: 1

Related Questions