CyberJugger
CyberJugger

Reputation: 41

Python3 web-scraper can't extract text from every <a> tag in the site

I'm trying to program a Python3 web-scraper that extracts text inside a tags from a site.

I'm using the bs4 library with this code:

from bs4 import BeautifulSoup
import requests

req = requests.get(mainUrl).text
soup = BeautifulSoup(req, 'html.parser')
for div in soup.find_all('div', 'turbolink_scroller'):
    for a in div.find_all('a', href=True, text=True):
       print(a.text)                                  

The only problem I encounter is that it does only find text with this type of syntax:

<div class="test">
 <a href="/link/to/whatIwant">Text that i want</a>
</div>

but not with this one:

<div class="test">
  <a href="/link/to/whatIwant2">
    The text
    <br>
    I would like
  </a>
</div>

Could you explain me why? and what are the differences between the two?

Upvotes: 0

Views: 43

Answers (1)

baduker
baduker

Reputation: 20052

It might have to do with the <br> tag within the second div. If you remove text=True you'll get both of them.

from bs4 import BeautifulSoup

sample = """
<div class="test">
 <a href="/link/to/whatIwant">Text that i want</a>
</div>
<div class="test">
  <a href="/link/to/whatIwant2">
    The text
    <br>
    I would like
  </a>
</div>
"""

for div in BeautifulSoup(sample, 'html.parser').find_all('div', 'test'):
    for a in div.find_all('a', href=True):
        print(a.getText(strip=True))

Output:

Text that i want
The textI would like

Upvotes: 1

Related Questions