Reputation: 43
I am trying to scrape data from the web, specifically from a table that has different filters and pages and I have the following code:
library (rvest)
url.colombia.compra <- "https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to = & date_from = "
tmp <- read_html (url.colombia.compra)
tmp_2 <- html_nodes (tmp, ".active")
the problem is that the code generates a list for me but I need to format it as a table and I have not succeeded, besides that it only shows me data from the first page of the table, how could I complement the code so that it allows me to read the data from all the pages in the table and format it as a table.
This is the table that looks like the table that shows the data
Upvotes: 0
Views: 42
Reputation: 504
I would split this problem into two parts. Your first is how to programmatically access each of the 11 pages of this online table.
Since this is a simple html table, using the "Next" button (siguiente) will take us to a new page. If we look at the URL on the Next page, we can see the page number in the query parameters.
...tienda-virtual-del-estado-colombiano/ordenes-compra?page=1&number_order=&state...
We know that the pages are numbered starting with 0 (because "next" takes us to page1), and using the navigation bar we can see that there are 11 pages.
We can use the query parameters to construct a series of rvest::read_html()
calls corresponding to page number by simply using lapply and paste0 to replace the page=
. This will let us access all the pages of the table.
The second part is making use of rvest::html_table
which will parse a tibble from the results of read_html
pages <-
lapply(0:11, function(x) {
data.frame(
html_table(
read_html(x = paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
x,
"&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_="))
)
)
})
The result is a list of dataframes which we can combine with do.call
.
do.call(rbind, pages)
Upvotes: 1