Mag
Mag

Reputation: 11

Beautifulsoup xml parsing failure: find all returns only one result

I am trying to parse all datasets' IDs from xml file using beautifulsoup my script:

soup = BeautifulSoup(source, "lxml")

doc = soup.find_all('doc')
string = doc.find('str', attrs={"name":"id"})

Once I run it to get the string for every doc I got error:

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

I tried to use For loop as follow with two formats for string separately:

for doc in soup.find_all('doc'):

        string = doc.find_all('str', attrs={"name":"id"})
       OR
        string = doc.str

But it returns only one result the first one

Here is the xml text I want to parse: "doc tag repeated several times for sure"

<doc>
    <str name="id"></str>
    <str name="version">20110601</str>
    <arr name="access"></arr>
    <arr name="cf_standard_name"></arr><arr name="cmor_table">
    <str name="instance_id"></str>
</doc>

Upvotes: 1

Views: 400

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195468

In your example soup.find_all('doc') will return all elements in the form of list. You need to iterate this list to find information you want.

If you want to extract specific tags inside <doc> tag, you can do it in various ways. I use CSS selectors, e.g. soup.select('doc str[name="id"]') will select all <str> tags with attribute name="id" which are inside <doc> tag:

data = """<doc>
<str name="id">1</str>
<str name="version">20110601</str>
<arr name="access"></arr>
<arr name="cf_standard_name"></arr><arr name="cmor_table">
<str name="instance_id"></str>
</doc>

<doc>
<str name="id">2</str>
<str name="version">20110602</str>
<arr name="access"></arr>
<arr name="cf_standard_name"></arr><arr name="cmor_table">
<str name="instance_id"></str>
</doc>

<doc>
<str name="id">3</str>
<str name="version">20110603</str>
<arr name="access"></arr>
<arr name="cf_standard_name"></arr><arr name="cmor_table">
<str name="instance_id"></str>
</doc>
"""

from bs4 import BeautifulSoup
from pprint import pprint

soup = BeautifulSoup(data, 'lxml')

all_ids = [tag.text for tag in soup.select('doc str[name="id"]')]
all_versions = [tag.text for tag in soup.select('doc str[name="version"]')]

pprint([*zip(all_ids, all_versions)])

This example prints:

[('1', '20110601'), ('2', '20110602'), ('3', '20110603')]

Upvotes: 1

Related Questions