Reputation: 137
I'm trying to scrape a site using Python with Beautifulsoup, but the site takes a long time to load and the scraping is fast and doesn't recover completely. I would like to know how to wait 5 seconds before retrieving the source code using Beautifulsoup.
I think a code is like this:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://www.edocente.com.br/pnld/2020/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
req = Request(url, headers = headers)
response = urlopen(req)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
soup.findAll('a', class_="btn bold mt-4 px-5")
I can't recovery a whole source code because site is slow to load and my tags aren´t recovered. How to wait to recover the whole source code from site?
I would like to get only the text of the href tags, as below:
<a href="/pnld/2020/obra/companhia-das-ciencias-6-ano-saraiva" class="btn bold mt-4 px-5">Ver Obra </a>
<a href="/pnld/2020/obra/companhia-das-ciencias-7-ano-saraiva" class="btn bold mt-4 px-5">Ver Obra </a>
<a href="/pnld/2020/obra/companhia-das-ciencias-8-ano-saraiva" class="btn bold mt-4 px-5">Ver Obra </a>
I'd like to recovery:
/pnld/2020/obra/companhia-das-ciencias-6-ano-saraiva
/pnld/2020/obra/companhia-das-ciencias-7-ano-saraiva
/pnld/2020/obra/companhia-das-ciencias-8-ano-saraiva
How to do it? Thanks
Upvotes: 3
Views: 4582
Reputation: 1602
You could try to get a page asynchronously using aiohttp
and asyncio
. For example let’s pass that url
to query and headers
as parameters to ClientSession
instance, now we have a ClientResponse
object called resp
and you can get all the information you need from the response.
install modules by: pip install cchardet aiodns aiohttp aiotthp[speedups]
import aiohttp
import asyncio
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import ssl
#ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://www.edocente.com.br/pnld/2020/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
html = ''
async def main():
async with aiohttp.ClientSession() as session:
async with session.get(url, headers=headers, verify_ssl=False) as response:
print("Status:", response.status)
print("Content-type:", response.headers['content-type'])
html = await response.text()
soup = BeautifulSoup(html, 'html.parser')
print(soup.findAll('a', class_="btn bold mt-4 px-5"))
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
output:
Status: 200
Content-type: text/html; charset=UTF-8
[<a :href="linkPrefixo + obra.tituloSeo" class="btn bold mt-4 px-5">Ver {{ (current_edition=='2021-objeto-2') ? 'Coleção' : 'Obra' }} </a>, <a class="btn bold mt-4 px-5">AGUARDE</a>, <a class="btn bold mt-4 px-5">AGUARDE</a>, <a class="btn bold mt-4 px-5">AGUARDE</a>]
Upvotes: 1
Reputation: 1875
i guess at this url (https://www.edocente.com.br/pnld/2020/
) there is a dynamic
website. meaning that you cant
load dynamic websites with requests
or urllib
.
for loading dynamic websites and then save them to beautiful soup you need to use a browser to load the website in the background. there are libaries for doing that.
here is a snippet to load dynamic websites
from playwright.sync_api import sync_playwright
def get_dynamic_soup(url: str) -> BeautifulSoup:
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
soup = BeautifulSoup(page.content(), "html.parser")
browser.close()
return soup
install the python package
pip install playwright
then install the chromium browser (in your terminal)
(shell prompt) > playwright install
and you are ready to scrape dynamic websites.
Upvotes: 2