Kaitlin
Kaitlin

Reputation: 59

Web scraping from an HTML table using rvest

I'm new to web scraping and am trying to scrape the following table:

                    <table class="dp-firmantes table table-condensed table->striped">
                        <thead>
                            <tr>
                                <th>FIRMANTE</th>
                                <th>DISTRITO</th>
                                <th>BLOQUE</th>
                            </tr>
                        </thead>
                        <tbody>

                            <tr>
                                <td>ROMERO, JUAN CARLOS</td>
                                <td>SALTA</td>
                                <td>JUSTICIALISTA 8 DE OCTUBRE</td>
                            </tr>
                            <tr>
                                <td>FIORE VIÑUALES, MARIA CRISTINA DEL >VALLE</td>
                            <td>SALTA</td>
                                <td>PARES</td>
                            </tr>
                            </tbody>
                    </table>

I'm using the rvest package and my code is the following:

link <- read_html("https://www.hcdn.gob.ar/proyectos/resultados-buscador.html?")
table <- html_nodes(link, 'table.dp-firmantes table table-condensed table-striped')

But when I go to look at the 'table' object in R, I get the following error: {xml_nodeset (0)}

My instinct is that I'm actually not scraping any of the html content from the table, but I don't know how to fix this/why this is occurring. I'm not sure if the error is in my R code, if I'm just using the wrong CSS selector or if maybe this is javascript code and not html? Please let me know what I'm doing wrong here.

Edited: here is the link I'm using https://www.hcdn.gob.ar/proyectos/resultados-buscador.html

Edited: screenshot of the search results table

Upvotes: 0

Views: 812

Answers (1)

Nicol&#225;s Velasquez
Nicol&#225;s Velasquez

Reputation: 5908

You could try the following code to parse the "Listado de Autores" tables for those bills that have them. For instance bill with expendiente N. 820/18 (link = http://www.senado.gov.ar/parlamentario/comisiones/verExp/820.18/S/PL) has that table, but I webscraped the first 500 bills and did not find any other bill with such data.

library(tidyverse)
library(rvest)

html_object <- read_html('http://www.senado.gov.ar/parlamentario/comisiones/verExp/820.18/S/PL')

html_object %>% 
html_node(xpath = "//div[@id = 'Autores']/table") %>% # This is the xpath adress that worked for me. The CSS locator ypu provide did not work.
html_table() %>% as_data_frame() %>% ## Get the html table and store it in a tibble
mutate(X1 = gsub("\\n|\\t|  ", "", X1)) ##Remove the extra line brakes (\\n), tabs (\\t), and spaces ("  ") present in the html table.

Results:

# A tibble: 2 x 2
  X1
  <chr>
1 Romero, Juan Carlos
2 Fiore Viñuales, María Cristina Del Valle

Edited: Screenshot of Rś html capture thrugh read_html('https://www.hcdn.gob.ar/proyectos/resultados-buscador.html?pagina=2')

enter image description here

Upvotes: 1

Related Questions