Reputation: 11
I am trying to parse all datasets' IDs from xml file using beautifulsoup my script:
soup = BeautifulSoup(source, "lxml")
doc = soup.find_all('doc')
string = doc.find('str', attrs={"name":"id"})
Once I run it to get the string for every doc I got error:
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I tried to use For loop as follow with two formats for string separately:
for doc in soup.find_all('doc'):
string = doc.find_all('str', attrs={"name":"id"})
OR
string = doc.str
But it returns only one result the first one
Here is the xml text I want to parse: "doc tag repeated several times for sure"
<doc>
<str name="id"></str>
<str name="version">20110601</str>
<arr name="access"></arr>
<arr name="cf_standard_name"></arr><arr name="cmor_table">
<str name="instance_id"></str>
</doc>
Upvotes: 1
Views: 400
Reputation: 195468
In your example soup.find_all('doc')
will return all elements in the form of list. You need to iterate this list to find information you want.
If you want to extract specific tags inside <doc>
tag, you can do it in various ways. I use CSS selectors, e.g. soup.select('doc str[name="id"]')
will select all <str>
tags with attribute name="id"
which are inside <doc>
tag:
data = """<doc>
<str name="id">1</str>
<str name="version">20110601</str>
<arr name="access"></arr>
<arr name="cf_standard_name"></arr><arr name="cmor_table">
<str name="instance_id"></str>
</doc>
<doc>
<str name="id">2</str>
<str name="version">20110602</str>
<arr name="access"></arr>
<arr name="cf_standard_name"></arr><arr name="cmor_table">
<str name="instance_id"></str>
</doc>
<doc>
<str name="id">3</str>
<str name="version">20110603</str>
<arr name="access"></arr>
<arr name="cf_standard_name"></arr><arr name="cmor_table">
<str name="instance_id"></str>
</doc>
"""
from bs4 import BeautifulSoup
from pprint import pprint
soup = BeautifulSoup(data, 'lxml')
all_ids = [tag.text for tag in soup.select('doc str[name="id"]')]
all_versions = [tag.text for tag in soup.select('doc str[name="version"]')]
pprint([*zip(all_ids, all_versions)])
This example prints:
[('1', '20110601'), ('2', '20110602'), ('3', '20110603')]
Upvotes: 1