Il Forna
Il Forna

Reputation: 23

R - Importing and formatting mutiple tables from various urls

I am a newbie in R so probably there is already some answer to the following question but I haven't find a solution matching the issue I am facing. I am trying to get tables from a number of webpages. They shold be around 5200. I have imported one in order to format it but I need to automatize the process to get them all. Here's the url:

  http://www.tbca.net.br/base-dados/int_composicao_estatistica.php?cod_produto=C0195C

I have tried to find out a way to get all the tables by doing:

  url <- paste0("http://www.tbca.net.br/base-dados/int_composicao_estatistica.php?cod_produto=", ., sep="" )

but I receive an error message according which

 , .,

cannot be read. In any case, I neither get how to automatize the process of formatting once I would get it. Any hint?

Upvotes: 0

Views: 32

Answers (1)

DaveArmstrong
DaveArmstrong

Reputation: 21937

Here's how you would do it for one product:

url <- "http://www.tbca.net.br/base-dados/int_composicao_estatistica.php?cod_produto=C0195C"
h <- read_html(url)
tab <- html_table(h, fill=TRUE) %>% 
  as_tibble(.name_repair = "universal")
tab
# # A tibble: 37 x 1
#    ...1$Componente $Unidades $`Valor por 100… $`Desvio padrão` $`Valor Mínimo` $`Valor Máximo` $`Número de dad…
#    <chr>           <chr>     <chr>            <chr>            <chr>           <chr>           <chr>           
#   1 Energia         kJ        578              -                -               -               -               
#   2 Energia         kcal      136              -                -               -               -               
#   3 Umidade         g         65,5             -                -               -               -               
#   4 Carboidrato to… g         33,3             -                -               -               -               
#   5 Carboidrato di… g         32,5             -                -               -               -               
#   6 Proteína        g         0,60             -                -               -               -               
#   7 Lipídios        g         0,26             -                -               -               -               
#   8 Fibra alimentar g         0,84             -                -               -               -               
#   9 Álcool          g         0,00             -                -               -               -               
#   10 Cinzas          g         0,39             -                -               -               -               
#   # … with 27 more rows, and 2 more variables: $Referências <chr>, $`Tipo de dados` <chr>

If you wanted to scrape all the codes and get all of the tables, you could do that with the following. First, we can set up a loop to scrape all of the links. By investigating the source, you would find, as you did, that all of the product codes have "cod_produto" in the href attribute. You could use an xpath selector to keep only those a tags containing that string. You're basically looping over every page until you get to one that doesn't have any links. This gives you 5203 links.

library(glue)
all_links <- NULL
links <- "init"
i <- 1
while(length(links) > 0){
  url <- glue("http://www.tbca.net.br/base-dados/composicao_alimentos.php?pagina={i}&atuald=3")
  h <- read_html(url)
  links <- h %>% html_nodes(xpath = "//a[contains(@href,'cod_produto')]") %>% html_attr("href") %>% unique()
  all_links <- c(all_links, links)
  i <- i+1
}

EDIT

Next, we can follow each link and pull the table out of it, storing the table in the list called tabs. In answer to the question about how to get the name of the product in the data, there are two easy things to do. The first is to make the table into a data frame and then make a variable (I called it code) in the data frame that has the code name. The second is to set the list names to be the product code. The answer below has been edited to do both things.

all_links <- unique(all_links)
tabs <- vector(mode="list", length=length(all_links))
for(i in 1:length(all_links)){
  url <- glue("http://www.tbca.net.br/base-dados/{all_links[i]}")
  code <- gsub(".*=(.*)$", "\\1", url)
  h <- read_html(url)
  tmp <- html_table(h, fill=TRUE)[[1]]
  tmp <- as.data.frame(tmp)
  tmp$code <- code
  tabs[[i]] <- tmp
  names(tabs)[i] <- code
}

Upvotes: 1

Related Questions