Reputation: 49
I want to download some specific files from this page.
This exact four files:
How I am supposed to iterate through the page using selenium in a way I can maintain a good programming practices.
Is there a library better than selenium to do it?
I really just need some clarifying ideas.
Upvotes: 2
Views: 813
Reputation: 1076
Selenium is not lightweight, it is the last resort. It mimics the browser, so things like event handling (clicking some element, captcha submission, etc.). Also, if you're trying to scrape a page that uses JavaScript ( dynamically generated data that can not be found when you check the source code of the webpage), Selenium can be a good choice.
For any web scraping project, first, search your desired texts in the source code of the web page (press Ctrl+U
when you visit the page). If the desired element (texts/links etc.) can be found in the source code then you don't need to use a heavyweight library like selenium.
For this case, the texts you're trying to parse can be found in the source code.
so you can use requests
library and a simple parser library like bs4
For the selectors, check this page - W3Schools Css Selectors
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'}
url = 'https://www.gov.br/ans/pt-br/assuntos/consumidor/o-que-o-seu-plano-de-saude-deve-cobrir-1/o-que-e-o-rol-de-procedimentos-e-evento-em-saude'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
paragraphs = soup.select('div#parent-fieldname-text > h3 ~ p') # select all p element which comes after h3 tag inside div with "parent-fieldname-text" id
Output -
[<p class="callout"><a class="alert-link external-link" href="http://www.ans.gov.br/component/legislacao/?view=legislacao&task=TextoLei&format=raw&id=NDAzMw==" target="_blank">Resolução Normativa n°465/2021</a></p>,
<p class="callout"><a class="internal-link" data-tippreview-enabled="true" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/Anexo_I_Rol_2021RN_465.2021_RN473_RN478_RN480_RN513_RN536.pdf" target="_self" title="">Anexo I - Lista completa de procedimentos (.pdf)</a></p>,
<p class="callout"><a class="internal-link" data-tippreview-enabled="true" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/Anexo_I_Rol_2021RN_465.2021_RN473_RN478_RN480_RN513_RN536.xlsx" target="_self" title="">Anexo I - Lista completa de procedimentos (.xlsx)</a></p>,
<p class="callout"><a class="internal-link" data-tippreview-enabled="true" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/Anexo_II_DUT_2021_RN_465.2021_tea.br_RN473_RN477_RN478_RN480_RN513_RN536.pdf" target="_self" title="">Anexo II - Diretrizes de utilização (.pdf)</a></p>,
<p class="callout"><a class="internal-link" data-tippreview-enabled="false" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/Anexo_III_DC_2021_RN_465.2021.v2.pdf" target="_self" title="">Anexo III - Diretrizes clínicas (.pdf)</a></p>,
<p class="callout"><a class="internal-link" data-tippreview-enabled="false" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/Anexo_IV_PROUT_2021_RN_465.2021.v2.pdf" target="_self" title="">Anexo IV - Protocolo de utilização (.pdf)</a></p>,
<p class="callout"><a class="alert-link internal-link" data-tippreview-enabled="true" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/nota13_geas_ggras_dipro_17012013.pdf" target="_blank" title="">Nota sobre as Terminologias</a><br/> Rol de Procedimentos e Eventos em Saúde, Terminologia Unificada da Saúde Suplementar - TUSS e Classificação Brasileira Hierarquizada de Procedimentos Médicos - CBHPM</p>,
<p class="callout"><a class="internal-link" data-tippreview-enabled="true" data-tippreview-image="" data-tippreview-title="" href="https://www.gov.br/ans/pt-br/arquivos/assuntos/consumidor/o-que-seu-plano-deve-cobrir/CorrelacaoTUSS.2021Rol.2021_RN478_RN480_RN513_FU_RN536_20220506.xlsx" target="_self" title="">Correlação TUSS X Rol<br/></a> Correlação entre o Rol de Procedimentos e Eventos em Saúde e a Terminologia Unificada da Saúde Suplementar – TUSS</p>]
You're looking for the first few elements from this output list
to download all files use :
for p in paragraphs:
r = requests.get(p.get_text(), allow_redirects=True)
open(p.get_text(), 'wb').write(r.content)
Upvotes: 2