Reputation: 2127
I have been stuck on this problem for a while now but no solution. I have a snippet of my Python script that looks like so:
pub_ref = soup.findAll("publication-reference")
with open('./output.csv', 'ab+') as f:
writer = csv.writer(f, dialect = 'excel')
for info in pub_ref:
pat_cite = soup.findAll("patcit")
for item in pat_cite:
if item.find("name"):
name = item.find("name").text
writer.writerow([name])
This part of the script I want to parse children of a citation child root "pacit" of the parent "publication-reference" that crops up multiple times in the XML file and looks like this:
.
.
.
<us-references-cited>
<us-citation>
<patcit num="00001">
<document-id>
<country>US</country>
<doc-number>1589850</doc-number>
<kind>A</kind>
<name>Haskell</name>
<date>19260600</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
<patcit num="00002">
<document-id>
<country>US</country>
<doc-number>D134414</doc-number>
<kind>S</kind>
<name>Orme, Jr.</name>
<date>19421100</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
.
.
.
The dots indicate that the file is larger than this and isn't showing the parent root "publication-reference". The problem is that my script only parses one of the many children of pacit, the "name" root, once through as you can tell. And this works fine for those roots that have only one entry per invention, but not multiples.
I also want to store these in an CSV file, as you can see with the writer, whereby the output shows these multiple patcit citations down a column like so:
invention name country city .... patcit name1 patcit date1....
white space patcit name2 patcit date2....
white space patcit name2 patcit date3....
The XML files I'm using can be found here at https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2017/
Any help would be appreciated as I've tried multiple ways and I feel this is a beginner's problem.
Upvotes: 0
Views: 347
Reputation: 9440
First of all I downloaded one of the zip files "ipg170103.zip" and found it contained multiple xml documents. So I ran (on Linux)
csplit ipg170103.xml '/xml version/' '{*}'
To split the files into multiple single documents. Working with one of these files "xx995" I managed to see what you are working with. using "grep" on the file for "country" I discovered many instances of the word so I guessed you wanted the "country" under "publication-reference" (if not you will have to change the script) and likewise "invention" from "invention-title". I also discovered multiple instances of "date" under "patcit" not all of them had a name with them so my script omits these. I found too many "city" elements to know which one you wanted. But in any case I could not determine exactly what you wanted so you may well have to tweak it a bit for your exact needs.
from bs4 import BeautifulSoup
import csv
xml = open("xx995",'r').read()
soup = BeautifulSoup(xml, 'lxml')
pat = soup.find("us-patent-grant")
country = pat.find("publication-reference").find("country").text
invention = pat.find("invention-title").text
data = []
pat_cite = pat.findAll("patcit")
for item in pat_cite:
name = None
date = None
if item.find("name"):
name = item.find("name").text
# Only get date if name
if item.find("date"):
date = item.find("date").text
data.append((name,date))
with open('./output.csv', 'wt') as f:
writer = csv.writer(f, dialect='excel')
writer.writerow(('invention', 'country', 'patcit name', 'patcit date'))
for d in data:
writer.writerow((invention, country, d[0], d[1]))
invention = None
country = None
Outputs:
Upvotes: 1