Reputation: 269
I'm trying to extract texts from this webpage below:
<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >
Text1 I want</a> > Category2: <a href="SomeURL" >Text2 I want</a></div>
I tried:
for div in soup.find_all('div', class_='MYCLASS'):
for url in soup.find_all('a', id='category1'):
print(url)
And it returned:
<a href="someURL" id="category1">Text1 I want</a>
So I split the text...
for div in soup.find_all('div', class_='MYCLASS'):
for url in soup.find_all('a', id='category1'):
category1 = str(url).split('category1">')[1].split('</a>')[0]
print(category1)
and extracted "Text1 I want", but still miss "Text2 I want". Any idea? Thank you.
EDIT:
There are other < a> < /a> in the source code, so if I remove id=
from my code, it would return all other texts that I don't need. For examples,
<div class="MYClass"><span class="Class">RandomText.<br>RandomText.<br>
<a href=someURL>RandomTextExtracted.</a><br>
Also,
</div><div class=MYClass>
<a href="SomeURL>RandomTextExtracted</a>
Upvotes: 0
Views: 106
Reputation: 7248
Since the id
of an element is unique, you can find the first <a>
tag using id="category1"
. To find the next <a>
tag, you can use find_next()
method.
html = '''<div class="MYCLASS">Category1: <a id=category1 href="SomeURL" >Text1 I want</a> > Category2: <a href="SomeURL" >Text2 I want</a></div>'''
soup = BeautifulSoup(html, 'lxml')
a_tag1 = soup.find('a', id='category1')
print(a_tag1) # or use `a_tag1.text` to get the text
a_tag2 = a_tag1.find_next('a')
print(a_tag2)
Output:
<a href="SomeURL" id="category1">Text1 I want</a>
<a href="SomeURL">Text2 I want</a>
(I've tested it for the link you've provided, and it works there too.)
Upvotes: 1
Reputation: 17408
You need a your code a little
from bs4 import BeautifulSoup
soup = BeautifulSoup("<div class=\"MYCLASS\">Category1: <a id=category1 href=\"SomeURL\" > \
Text1 I want</a> > Category2: <a href=\"SomeURL\" >Text2 I want</a></div> \
I","lxml")
for div in soup.find_all('div', class_='MYCLASS'):
for url in soup.find_all('a'):
print(url.text.strip())
Remove id for 'a' tag and run the same code.
If you need text of specify ids, you need to know the ids.
ids = [id1,id2]
for div in soup.find_all('div', class_='MYCLASS'):
for id in ids:
for url in soup.find_all('a',id=id):
print(url.text.strip())
Upvotes: 0