Alejandro Chaves
Alejandro Chaves

Reputation: 43

How to table data scraped from the web and read all the data from a table

I am trying to scrape data from the web, specifically from a table that has different filters and pages and I have the following code:

library (rvest)

url.colombia.compra <- "https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to = & date_from = "
tmp <- read_html (url.colombia.compra)
tmp_2 <- html_nodes (tmp, ".active")

the problem is that the code generates a list for me but I need to format it as a table and I have not succeeded, besides that it only shows me data from the first page of the table, how could I complement the code so that it allows me to read the data from all the pages in the table and format it as a table.

This is the table that looks like the table that shows the data

enter image description here

Upvotes: 0

Views: 42

Answers (1)

Kent Orr
Kent Orr

Reputation: 504

I would split this problem into two parts. Your first is how to programmatically access each of the 11 pages of this online table.

Since this is a simple html table, using the "Next" button (siguiente) will take us to a new page. If we look at the URL on the Next page, we can see the page number in the query parameters.

...tienda-virtual-del-estado-colombiano/ordenes-compra?page=1&number_order=&state...

We know that the pages are numbered starting with 0 (because "next" takes us to page1), and using the navigation bar we can see that there are 11 pages.

We can use the query parameters to construct a series of rvest::read_html() calls corresponding to page number by simply using lapply and paste0 to replace the page=. This will let us access all the pages of the table.

The second part is making use of rvest::html_table which will parse a tibble from the results of read_html

pages <-
  lapply(0:11, function(x) {
    data.frame(
      html_table(
        read_html(x = paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=", 
                             x, 
                             "&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_="))
      )
    )
        
  })

The result is a list of dataframes which we can combine with do.call.

do.call(rbind, pages)

Upvotes: 1

Related Questions