HelloToEarth
HelloToEarth

Reputation: 2117

Python skipping XML child nodes while parsing in BeautifulSoup

I'm running into a problem where a certain group of child tags aren't parsing fully. I follow the same logic as this with other tags but for reasons I cannot see it's skipping all but the first entries of the tags I want. A basic snippet of my current script looks something like this:

for xml_string in separated_xml(infile):

   soup = BeautifulSoup(xml_string, "lxml")

        us_grant = soup.findAll("us-patent-grant")

        with open('./output.csv', 'ab+') as f:
            writer = csv.writer(f, dialect = 'excel')

            for info in us_grant:

              data = []
              us_class_search = soup.findAll("us-field-of-classification-search")

              for item2 in us_class_search:

                if item2.find("classification-national"):

                  country_search = item2.find("country")
                  main_class_search = item2.find("main-classification")

                  data.append((country_search).text)
                  data.append((main_class_search).text)
              print(data)

Once I print(data) after each iteration it only gives me the first entry of each country and main-classification tag under every us-patent-grant parent; but there are many more. For example the XML file with these tags looks like:

<us-field-of-classification-search>
<classification-national>
<country>US</country>
<main-classification>D 1100-130</main-classification>
<additional-info>unstructured</additional-info>
</classification-national>
<classification-national>
<country>US</country>
<main-classification>D 1199</main-classification>
</classification-national>
<classification-national>
<country>US</country>
<main-classification>426  5</main-classification>
</classification-national>
<classification-national>
<country>US</country>
<main-classification>426 76</main-classification>
</classification-national>
</us-field-of-classification-search>

I know that the if item2 statement does not follow through correctly because even if I take it out it still runs the for item2 loop the same way; finding and appending only the first tags under each. This means it must be the loop itself not running through and finding each instance and treating the first tag as all of them.

Any ideas? I don't see any obvious fault in logic.

If you'd like to see the XML itself you can find it here on USPTO

Upvotes: 1

Views: 274

Answers (2)

HelloToEarth
HelloToEarth

Reputation: 2117

Simple fix when introducing the for-loop:

for items in item2.findAll("classification-national")

Allowed me to hit every component inside the parent.

Upvotes: 0

Jim Garrison
Jim Garrison

Reputation: 86764

item2 is iterating over all <us-field-of-classification-search> elements, of which there is only one in your sample.

Within that loop you should then be iterating over <classification-national> elements, but you are examining only the first one.

Upvotes: 1

Related Questions