AndyC
AndyC

Reputation: 83

how to extract text html using R

I need to extract the following a block of text from a set of google results obtained using

require(XML)
    require(RCurl)
input<-"R%statistical%Software"
 require(XML)
    require(RCurl)
    url <- paste("https://www.google.com/search?q=\"",
                 input, "\"", sep = "")

    CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
    script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
    doc <- htmlParse(script)

in the R package XML

An extract of the extracted HTML document as follows

</ul></div>
</div>
</div>
<span class="st">R, also called GNU S, is a strongly functional language and environment to <br>
statistically explore data sets, make many graphical displays of data from custom<br>
 <b>...</b></span><br>
</div>
<table class="slk" cellpadding="0" cellspacing="0" style="border-collapse:collapse;margin-top:1px">
<tr class="mslg">
<td style="padding-left:23px;vertical-align:top"><div class="sld">

In this example I need to extract the following text for each result returned

"R, also called GNU S, is a strongly functional language and environment to
statistically explore data sets, make many graphical displays of data from custom
"

I have had a go with some of the functions in the XML package for R, but I don't think I understand enough about HTML and XML. The text will vary for each result returned, so its actually the

<span class="st">

?field? I need to extract. As you have probably guessed I am not familiar with HTML or XML. So any recommendations for a good tutorial or book that would give me enough of an overview to solve these kind of problems would be most welcome. Thanks

Upvotes: 3

Views: 5369

Answers (1)

jlhoward
jlhoward

Reputation: 59355

This returns a list, result with the text from all span tags using class="st" (there are 7 in your document).

input<-"R%statistical%Software"
url <- paste0("http://www.google.com/search?q=",input)
doc <- htmlParse(url)
result <- lapply(doc['//span[@class="st"]'],xmlValue)
result[1]
# [[1]]
# [1] "R, also called GNU S, is a strongly functional language and environment to \nstatistically explore data sets, make many graphical displays of data from custom\n ..."

Note the use of http instead of https greatly simplifies retrieval of the document.

Upvotes: 4

Related Questions