Reputation: 560
I am trying to extract from this website a list of four links that are clearly named as:
PNADC_012018_20190729.zip
PNADC_022018_20190729.zip
PNADC_032018_20190729.zip
PNADC_042018_20190729.zip
I've seen that they are all part of a class called 'jstree-wholerow'. I'm not really good at scraping, yet I've tried to capture such links using this regularity:
x <- rvest::read_html('https://www.ibge.gov.br/estatisticas/downloads-estatisticas.html?caminho=Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018') %>%
rvest::html_nodes("jstree-wholerow") %>%
rvest::html_text()
However, I received an empty vector as output.
Can someone help fixing this?
Upvotes: 1
Views: 112
Reputation: 7106
Although the webpage uses javascript, the files are stored in a ftp. It also has very predictable directory names.
library(tidyverse)
library(stringr)
library(rvest)
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
library(RCurl)
#>
#> Attaching package: 'RCurl'
#> The following object is masked from 'package:tidyr':
#>
#> complete
link <- 'https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_042018_20190729.zip'
zip_names <- c('PNADC_012018_20190729.zip', 'PNADC_022018_20190729.zip', 'PNADC_032018_20190729.zip', 'PNADC_042018_20190729.zip')
links <- str_replace(link, '/2018.*\\.zip$', str_c('/2018/', zip_names))
links
#> [1] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_012018_20190729.zip"
#> [2] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_022018_20190729.zip"
#> [3] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_032018_20190729.zip"
#> [4] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_042018_20190729.zip"
#option 2
links <- RCurl::getURL(url = 'https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/') %>% read_html() %>%
html_nodes(xpath = '//td/a[@href]') %>% html_attr('href')
links <- links[-1]
links
#> [1] "PNADC_012018_20190729.zip" "PNADC_022018_20190729.zip"
#> [3] "PNADC_032018_20190729.zip" "PNADC_042018_20190729.zip"
str_c('https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/', links)
#> [1] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_012018_20190729.zip"
#> [2] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_022018_20190729.zip"
#> [3] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_032018_20190729.zip"
#> [4] "https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2018/PNADC_042018_20190729.zip"
Created on 2021-06-11 by the reprex package (v2.0.0)
Upvotes: 3