shanky_thebearer
shanky_thebearer

Reputation: 219

Reading only the relevant text from an HTML page using R

Is there a way to access only the textual content on Wikipedia using R. Something equivalent to jSoup as shown in this post on stack Extraction of text using: Jsoup

Thanks.

Upvotes: 0

Views: 561

Answers (1)

RmIu
RmIu

Reputation: 4487

From here:

# load packages
library(RCurl)
library(XML)

# download html
html <- getURL("https://en.wikipedia.org/wiki/Main_Page", followlocation = TRUE)

# parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
cat(paste(plain.text, collapse = "\n"))

Upvotes: 2

Related Questions