Reputation: 59
I am trying to scrape the content of the following div
tag:
<div style="font-weight:normal">
<h3> PROYECTO DE LEY </h3> <br>
<strong>Expediente </strong>4893-D-2007<br>
<strong>Sumario: </strong>LEY DE EDUCACION SUPERIOR: PRINCIPIOS
GENERALES, ESTRUCTURA Y ARTICULACION, DE LOS INSTITUTOS DE EDUCACION
SUPERIOR, DE LOS TITULOS Y PLANES DE ESTUDIO, ORGANOS DE GOBIERNO,
EDUCACION SUPERIOR A DISTANCIA, DEROGACION DE LA LEY 24521.<br>
<strong>Fecha: </strong><br>
</div>
using rvest
in R. I have the following code so far:
link <-
read_html(paste0("https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?
exp=4893-D-2007"))
type <- html_nodes(link, 'h3')
type_text <- html_text(type)
table <-html_node(link, "table.table.table-bordered tbody")
table_text <- html_text(table)
table_text <- gsub("\n", "", table_text)
table_text <- gsub("\t", "", table_text)
table_text <- gsub("", "", table_text)
#this is the relevant part of the code that attempts to capture the
style css selector
billsum <- html_node(link, style*='font-weight:normal')
billsum_text <- html_text(billsum)
I'm not really sure what's happening with the code or if there's a better way to scrape this information, but I'd really like to be able to scrape the sumario and fecha content.
Upvotes: 0
Views: 284
Reputation: 13
Inside the html_node() function you can choose the relevant css part, which in your case is ".interno div". I recommend using "SelectorGadget" for Google Chrome. There you can click on specific parts of a webpage and exclude others to know which one you want.
link <-
read_html(paste0("https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=4893-D-2007"))
billsum <- html_node(link, ".interno div")
billsum_text <- html_text(billsum)
Upvotes: 0
Reputation: 389105
To get the "Sumario" content you can do
library(rvest)
url <- "https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=4893-D-2007"
url %>%
read_html() %>%
html_text() %>%
gsub("\t|\n", "", .) %>%
sub(".*Sumario:(.*)\\.Fecha:.*", "\\1", .)
#[1] " LEY DE EDUCACION SUPERIOR: PRINCIPIOS GENERALES, ESTRUCTURA Y ARTICULACION,
# DE LOS INSTITUTOS DE EDUCACION SUPERIOR, DE LOS TITULOS Y PLANES DE ESTUDIO,
# ORGANOS DE GOBIERNO, EDUCACION SUPERIOR A DISTANCIA, DEROGACION DE LA LEY 24521"
Upvotes: 1