Reading only the relevant text from an HTML page using R

Question

Is there a way to access only the textual content on Wikipedia using R. Something equivalent to jSoup as shown in this post on stack Extraction of text using: Jsoup

Thanks.

RmIu · Accepted Answer

From here:

# load packages
library(RCurl)
library(XML)

# download html
html <- getURL("https://en.wikipedia.org/wiki/Main_Page", followlocation = TRUE)

# parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
cat(paste(plain.text, collapse = "
"))

Reading only the relevant text from an HTML page using R

Answers (1)

Related Questions