Arthur Carvalho Brito
Arthur Carvalho Brito

Reputation: 560

Extracting a list of links from a webpage by using its class

I am trying to extract from this website a list of four links that are clearly named as:

PNADC_012018_20190729.zip

PNADC_022018_20190729.zip

PNADC_032018_20190729.zip

PNADC_042018_20190729.zip

I've seen that they are all part of a class called 'jstree-wholerow'. I'm not really good at scraping, yet I've tried to capture such links using this regularity:

x <- rvest::read_html('https://www.ibge.gov.br/estatisticas/downloads-estatisticas.html?caminho=Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018') %>%
  rvest::html_nodes("jstree-wholerow") %>%
  rvest::html_text()

However, I received an empty vector as output.

Can someone help fixing this?

Upvotes: 1

Views: 112

Answers (1)

jpdugo17
jpdugo17

Reputation: 7106

Although the webpage uses javascript, the files are stored in a ftp. It also has very predictable directory names.

library(tidyverse)
library(stringr)
library(rvest)
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding
library(RCurl)
#> 
#> Attaching package: 'RCurl'
#> The following object is masked from 'package:tidyr':
#> 
#>     complete

link <- 'https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_042018_20190729.zip'

zip_names <- c('PNADC_012018_20190729.zip', 'PNADC_022018_20190729.zip', 'PNADC_032018_20190729.zip', 'PNADC_042018_20190729.zip')

links <- str_replace(link, '/2018.*\\.zip$', str_c('/2018/', zip_names))

links
#> [1] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_012018_20190729.zip"
#> [2] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_022018_20190729.zip"
#> [3] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_032018_20190729.zip"
#> [4] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_042018_20190729.zip"


#option 2

links <-  RCurl::getURL(url = 'https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/') %>% read_html() %>% 
    html_nodes(xpath = '//td/a[@href]') %>% html_attr('href')

links <- links[-1]

links
#> [1] "PNADC_012018_20190729.zip" "PNADC_022018_20190729.zip"
#> [3] "PNADC_032018_20190729.zip" "PNADC_042018_20190729.zip"

str_c('https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/', links)
#> [1] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_012018_20190729.zip"
#> [2] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_022018_20190729.zip"
#> [3] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_032018_20190729.zip"
#> [4] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_042018_20190729.zip"

Created on 2021-06-11 by the reprex package (v2.0.0)

Upvotes: 3

Related Questions