Reputation: 2127
So I've run into an issue where I've been parsing an XML file like so:
soup = BeautifulSoup(xml_string, "lxml")
pub_ref = soup.findAll("publication-reference")
with open('./output.csv', 'ab+') as f:
writer = csv.writer(f, dialect = 'excel')
for info in pub_ref:
assign = soup.findAll("assignee")
pat_cite = soup.findAll("patcit")
for item1 in assign:
if item.find("orgname"):
org_name = item.find("orgname").text
for item2 in pat_cite:
if item2.find("name"):
name = item2.find("name").text
for inv_name, pat_num, cpc_num, class_num, subclass_num, date_num, country, city, state in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("section"), soup.findAll("class"), soup.findAll("subclass"), soup.findAll("date"), soup.findAll("country"), soup.findAll("city"), soup.findAll("state")):
writer.writerow([inv_name.text, pat_num.text, org_name, cpc_num.text, class_num.text, subclass_num.text, date_num.text, country.text, city.text, state.text, name])
I was limited to only a few elements (as shown in the text entries at the end) but I now have about 10 more parent elements with over 30 more child elements I need to parse so explicitly stating them all out like this won't really work well anymore. Also, I have repeats in the data which looks like:
<us-references-cited>
<us-citation>
<patcit num="00001">
<document-id>
<country>US</country>
<doc-number>1589850</doc-number>
<kind>A</kind>
<name>Haskell</name>
<date>19260600</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
<patcit num="00002">
<document-id>
<country>US</country>
<doc-number>D134414</doc-number>
<kind>S</kind>
<name>Orme, Jr.</name>
<date>19421100</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
I would like this to be able to parse repeated child roots (such as patcit) into my CSV file as columns like so:
invention name country city .... patcit name1 patcit date1....
white space patcit name2 patcit date2....
white space patcit name2 patcit date3....
And so on....because each invention has more than one citation or reference it will have only one column of most of the other information.
Upvotes: 0
Views: 759
Reputation: 22440
Try the below script. I suppose this is what you wanted to have.
from bs4 import BeautifulSoup
xml_content='''
<us-references-cited>
<us-citation>
<patcit num="00001">
<document-id>
<country>US</country>
<doc-number>1589850</doc-number>
<kind>A</kind>
<name>Haskell</name>
<date>19260600</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
<patcit num="00002">
<document-id>
<country>US</country>
<doc-number>D134414</doc-number>
<kind>S</kind>
<name>Orme, Jr.</name>
<date>19421100</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
'''
soup = BeautifulSoup(xml_content,"lxml")
for item in soup.select("patcit[num^=000]"):
name = item.select("name")[0].text
date = item.select("date")[0].text
kind = item.select("kind")[0].text
doc_number = item.select("doc-number")[0].text
country = item.select("country")[0].text
print(name,date,kind,doc_number,country)
Results:
Haskell 19260600 A 1589850 US
Orme, Jr. 19421100 S D134414 US
This solution is for the link you provided later:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2017/")
soup = BeautifulSoup(res.text,"lxml")
table = soup.select("table")[1]
for items in table.select("tr"):
data = ' '.join([item.text for item in items.select("td")])
print(data)
Upvotes: 1