Reputation: 219
Is there a way to access only the textual content on Wikipedia using R. Something equivalent to jSoup as shown in this post on stack Extraction of text using: Jsoup
Thanks.
Upvotes: 0
Views: 561
Reputation: 4487
From here:
# load packages
library(RCurl)
library(XML)
# download html
html <- getURL("https://en.wikipedia.org/wiki/Main_Page", followlocation = TRUE)
# parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
cat(paste(plain.text, collapse = "\n"))
Upvotes: 2