Reputation: 41
I'm trying to program a Python3 web-scraper that extracts text inside a tags from a site.
I'm using the bs4 library with this code:
from bs4 import BeautifulSoup
import requests
req = requests.get(mainUrl).text
soup = BeautifulSoup(req, 'html.parser')
for div in soup.find_all('div', 'turbolink_scroller'):
for a in div.find_all('a', href=True, text=True):
print(a.text)
The only problem I encounter is that it does only find text with this type of syntax:
<div class="test">
<a href="/link/to/whatIwant">Text that i want</a>
</div>
but not with this one:
<div class="test">
<a href="/link/to/whatIwant2">
The text
<br>
I would like
</a>
</div>
Could you explain me why? and what are the differences between the two?
Upvotes: 0
Views: 43
Reputation: 20052
It might have to do with the <br>
tag within the second div
. If you remove text=True
you'll get both of them.
from bs4 import BeautifulSoup
sample = """
<div class="test">
<a href="/link/to/whatIwant">Text that i want</a>
</div>
<div class="test">
<a href="/link/to/whatIwant2">
The text
<br>
I would like
</a>
</div>
"""
for div in BeautifulSoup(sample, 'html.parser').find_all('div', 'test'):
for a in div.find_all('a', href=True):
print(a.getText(strip=True))
Output:
Text that i want
The textI would like
Upvotes: 1