Reputation: 191

Parsing XML with Beautiful Soup

Edit: resolved. Thought I'd add my answer at the bottom...

Note: the desired output is a bunch of lines like

US D0591026

I have data that looks like the following in XML:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23" file="USD0591026-20090428.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20090414" date-publ="20090428">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>D0591026</doc-number>
<kind>S1</kind>
<date>20090428</date>
</document-id>
</publication-reference>
<application-reference appl-type="design">
<document-id>
<country>US</country>
<doc-number>29303426</doc-number>
<date>20080208</date>
</document-id>
</application-reference>
<us-application-series-code>29</us-application-series-code>
<priority-claims>
<priority-claim sequence="01" kind="national">
<country>CA</country>
<doc-number>122078</doc-number>
<date>20070830</date>
</priority-claim>
</priority-claims>
<us-term-of-grant>
<length-of-grant>14</length-of-grant>
</us-term-of-grant>
<classification-locarno>
<edition>9</edition>
<main-classification>0101</main-classification>
</classification-locarno>
<classification-national>
<country>US</country>
<main-classification>D 1106</main-classification>
</classification-national>
<invention-title id="d0e71">Edible fruit product in the shape of a rocketship</invention-title>
<references-cited>

I am trying to pull out the country, and the document number. I've gotten to this point:

import os
import io
from bs4 import BeautifulSoup
import csv
import requests

directory_in_str = 'C:/Users/somedirectory'
directory = os.fsencode(directory_in_str)

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    full_name = directory_in_str + filename
    handler = open(full_name).read()
    soup = BeautifulSoup(handler, 'lxml')
    patents=soup.find_all('us-patent-grant')
    pub_ref=soup.find_all('publication-reference')
    country=soup.find_all('country')
    doc_num=soup.find_all('doc-number')
    for patent in pub_ref:
        for doc_num in patent:
            print(doc_num)

    continue

Where I can print out a nice block that includes those elements (what the code above does), but everything I have tried to get at those two specific elements (and then concatenate them) has failed. I've been able to do it with string operations, but the dataset isn't well formatted enough (I will be pulling out textfields without a standard length later) to feel confident that I can perform the whole analysis based on splicing strings.

Any ideas how I can drill down into those further tags and return just those two elements?

Ok, so I have made some changes, and gotten my code to:

import os
import io
from bs4 import BeautifulSoup
import csv
import requests

directory_in_str = 'C:/somedir'

directory = os.fsencode(directory_in_str)

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    full_name = directory_in_str + filename
    handler = open(full_name).read()
    soup = BeautifulSoup(handler, 'lxml')
    patents=soup.find_all('us-patent-grant')
    pub_ref=soup.find_all('publication-reference')
    for patent in pub_ref:
     country = patent.find_all('country')
     doc_num = patent.find_all('doc-number')
     print(country + doc_num)

    continue

Which gives me most of what I want. I am getting this:

[<country>US</country>, <doc-number>D0591026</doc-number>]

but what I want is just:

US D0591026

I understand the type of the object is a bs4 result set, but I am not familiar enough with how I only return the things in the tag. Eventually, this is going to a csv, so I don't want to have those tags in there.

I converted the soup objects to strings and used regular expressions to get the desired output

...
import re
...
...
     country = patent.find_all('country')
     doc_num = patent.find_all('doc-number')
     country_str = str(country)
     doc_num_str = str(doc_num)
     country_str2 = re.search('>(.*)<', country_str)
     doc_num_str2 = re.search('>(.*)<', doc_num_str)
     print(country_str2.group(1) + doc_num_str2.group(1))

Upvotes: 2

Answers (2)

gipsy

Reputation: 3859

Try This:

doc_nums=soup.find_all('doc-number')
for num in doc_nums:
  print(num.text)

Upvotes: 1

Vinícius Figueiredo

Reputation: 6518

To get a list with doc-number and it's related country using list comprehension and zip, a simple one-liner would be:

>>> [(country.text,number.text) for country, number in zip(soup.findAll("country"), soup.findAll("doc-number"))]
[('US', 'D0591026'), ('US', '29303426'), ('CA', '122078')]

Or perhaps a more readable way if you are not used to list comprehensions:

>>> lst = []
>>> for country, number in zip(soup.findAll("country"), soup.findAll("doc-number")):
    print(country.text, number.text)
    lst.append((country.text, number.text))


US D0591026
US 29303426
CA 122078
>>> lst
[('US', 'D0591026'), ('US', '29303426'), ('CA', '122078')]

Upvotes: 2

Parsing XML with Beautiful Soup

Answers (2)

Related Questions