Reputation: 191
Edit: resolved. Thought I'd add my answer at the bottom...
Note: the desired output is a bunch of lines like
US D0591026
I have data that looks like the following in XML:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23" file="USD0591026-20090428.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20090414" date-publ="20090428">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>D0591026</doc-number>
<kind>S1</kind>
<date>20090428</date>
</document-id>
</publication-reference>
<application-reference appl-type="design">
<document-id>
<country>US</country>
<doc-number>29303426</doc-number>
<date>20080208</date>
</document-id>
</application-reference>
<us-application-series-code>29</us-application-series-code>
<priority-claims>
<priority-claim sequence="01" kind="national">
<country>CA</country>
<doc-number>122078</doc-number>
<date>20070830</date>
</priority-claim>
</priority-claims>
<us-term-of-grant>
<length-of-grant>14</length-of-grant>
</us-term-of-grant>
<classification-locarno>
<edition>9</edition>
<main-classification>0101</main-classification>
</classification-locarno>
<classification-national>
<country>US</country>
<main-classification>D 1106</main-classification>
</classification-national>
<invention-title id="d0e71">Edible fruit product in the shape of a rocketship</invention-title>
<references-cited>
I am trying to pull out the country, and the document number. I've gotten to this point:
import os
import io
from bs4 import BeautifulSoup
import csv
import requests
directory_in_str = 'C:/Users/somedirectory'
directory = os.fsencode(directory_in_str)
for file in os.listdir(directory):
filename = os.fsdecode(file)
full_name = directory_in_str + filename
handler = open(full_name).read()
soup = BeautifulSoup(handler, 'lxml')
patents=soup.find_all('us-patent-grant')
pub_ref=soup.find_all('publication-reference')
country=soup.find_all('country')
doc_num=soup.find_all('doc-number')
for patent in pub_ref:
for doc_num in patent:
print(doc_num)
continue
Where I can print out a nice block that includes those elements (what the code above does), but everything I have tried to get at those two specific elements (and then concatenate them) has failed. I've been able to do it with string operations, but the dataset isn't well formatted enough (I will be pulling out textfields without a standard length later) to feel confident that I can perform the whole analysis based on splicing strings.
Any ideas how I can drill down into those further tags and return just those two elements?
Ok, so I have made some changes, and gotten my code to:
import os
import io
from bs4 import BeautifulSoup
import csv
import requests
directory_in_str = 'C:/somedir'
directory = os.fsencode(directory_in_str)
for file in os.listdir(directory):
filename = os.fsdecode(file)
full_name = directory_in_str + filename
handler = open(full_name).read()
soup = BeautifulSoup(handler, 'lxml')
patents=soup.find_all('us-patent-grant')
pub_ref=soup.find_all('publication-reference')
for patent in pub_ref:
country = patent.find_all('country')
doc_num = patent.find_all('doc-number')
print(country + doc_num)
continue
Which gives me most of what I want. I am getting this:
[<country>US</country>, <doc-number>D0591026</doc-number>]
but what I want is just:
US D0591026
I understand the type of the object is a bs4 result set, but I am not familiar enough with how I only return the things in the tag. Eventually, this is going to a csv, so I don't want to have those tags in there.
I converted the soup objects to strings and used regular expressions to get the desired output
...
import re
...
...
country = patent.find_all('country')
doc_num = patent.find_all('doc-number')
country_str = str(country)
doc_num_str = str(doc_num)
country_str2 = re.search('>(.*)<', country_str)
doc_num_str2 = re.search('>(.*)<', doc_num_str)
print(country_str2.group(1) + doc_num_str2.group(1))
Upvotes: 2
Views: 5945
Reputation: 3859
Try This:
doc_nums=soup.find_all('doc-number')
for num in doc_nums:
print(num.text)
Upvotes: 1
Reputation: 6518
To get a list with doc-number
and it's related country
using list comprehension and zip
, a simple one-liner would be:
>>> [(country.text,number.text) for country, number in zip(soup.findAll("country"), soup.findAll("doc-number"))]
[('US', 'D0591026'), ('US', '29303426'), ('CA', '122078')]
Or perhaps a more readable way if you are not used to list comprehensions:
>>> lst = []
>>> for country, number in zip(soup.findAll("country"), soup.findAll("doc-number")):
print(country.text, number.text)
lst.append((country.text, number.text))
US D0591026
US 29303426
CA 122078
>>> lst
[('US', 'D0591026'), ('US', '29303426'), ('CA', '122078')]
Upvotes: 2