Janjua
Janjua

Reputation: 237

Parse HTML into text with Div level in R

library(XML)
html <- read_html("https://www.sec.gov/Archives/edgar/data/1011290/000114036105007405/body.htm")
doc.html = htmlTreeParse(html, useInternal = TRUE)
doc.text = unlist(xpathApply(doc.html, '//div', xmlValue))

The above code reads text twice because of div level/structure, I need to read text only once. Thank you for your time and help. i.e.

doc.text[2] # contains all the text which repeats again in 3 to 59

Upvotes: 0

Views: 623

Answers (1)

Nicol&#225;s Velasquez
Nicol&#225;s Velasquez

Reputation: 5908

Try this:

library(rvest)
library(tidyverse)
html <- read_html("https://www.sec.gov/Archives/edgar/data/1011290/000114036105007405/body.htm")
text <- html %>% 
         html_nodes(xpath = "//text/div") %>%
         html_text(trim = TRUE) %>% 
         paste( collapse = ' ')

Upvotes: 1

Related Questions