Python3 web-scraper can't extract text from every tag in the site

Question

I'm trying to program a Python3 web-scraper that extracts text inside a tags from a site.

I'm using the bs4 library with this code:

from bs4 import BeautifulSoup
import requests

req = requests.get(mainUrl).text
soup = BeautifulSoup(req, 'html.parser')
for div in soup.find_all('div', 'turbolink_scroller'):
    for a in div.find_all('a', href=True, text=True):
       print(a.text)

The only problem I encounter is that it does only find text with this type of syntax:


 Text that i want

but not with this one:


  
    The text
    

    I would like

Could you explain me why? and what are the differences between the two?

baduker · Accepted Answer

It might have to do with the tag within the second div. If you remove text=True you'll get both of them.

from bs4 import BeautifulSoup

sample = """

 Text that i want


  
    The text
    

    I would like
  

"""

for div in BeautifulSoup(sample, 'html.parser').find_all('div', 'test'):
    for a in div.find_all('a', href=True):
        print(a.getText(strip=True))

Output:

Text that i want
The textI would like

Python3 web-scraper can't extract text from every <a> tag in the site

Answers (1)

Related Questions

Python3 web-scraper can&#39;t extract text from every &lt;a&gt; tag in the site

Answers (1)

Related Questions

Python3 web-scraper can't extract text from every <a> tag in the site