Btibert3
Btibert3

Reputation: 40146

Regex pattern match in a character

I am new to R so I apologize if this is easy and straight forward. I have successfully read a web page into a character vector. I want to strip this string down to a smaller segment so I can extract some data. So far, so easy.

The problem is that I am new to regex and R, so this has been pretty hard for me. I simply want to shorten the string such that it includes everything between the

<div class="appForm"

and 

</div>

For some reason, I am having a hard time using the stringr package and ?str_match.

Any help - more efficient solutions - will be very much appreciated. A newbie at web scraping, but determined to stay within R.

Upvotes: 1

Views: 654

Answers (3)

Vince
Vince

Reputation: 7638

I suggest using the XML package and XPath. This requires some learning, but if you're serious about web scraping, it's the way to go. I did this with some county level elections data from the NY Times website ages ago, and the code looked something like this (just to give you an idea):

getCounty <- function(url) {
    doc = htmlTreeParse(url, useInternalNodes = TRUE)

    nodes <- getNodeSet(doc, "//tr/td[@class='county-name']/text()")
    tmp <- sapply(nodes, xmlValue)
    county <- sapply(tmp, function(x) clean(x, num=FALSE))

    return(county)
}

You can learn about XPath here.

Another example: grab all R package names from the Crantastic timeline. This looks for a div node with the id "timeline", then looks for the ul with the class "timeline", and extracts all of the first a nodes from the parent node, and returns their text:

url <- 'http://crantastic.org/'
doc = htmlTreeParse(url, useInternalNodes = TRUE)

nodes <- getNodeSet(doc, "//div[@id='timeline']/ul[@class='timeline']/li/a[1]/text()")
tmp <- sapply(nodes, xmlValue)
tmp

>  [1] "landis"          "vegan"           "mutossGUI"       "lordif"         
 [5] "futile.paradigm" "lme4"            "tm"              "qpcR"           
 [9] "igraph"          "aspace"          "ade4"            "MCMCglmm"       
[13] "hts"             "emdbook"         "DCGL"            "wq"             
[17] "crantastic"      "Psychometrics"   "crantastic"      "gR"             
[21] "crantastic"      "Distributions"   "rAverage"        "spikeslab"      
[25] "sem"

Upvotes: 3

Richie Cotton
Richie Cotton

Reputation: 121077

I second Stephen and Vince's advice to use the htmlTreeParse in the XML package. There are quite a few SO questions related to scraping/using HTML content in R, based on this idea. Take a look at

Scraping html tables into R data frames using the XML package

How can I use R (Rcurl/XML packages ?!) to scrape this webpage ?

How to isolate a single element from a scraped web page in R

How to transform XML data into a data.frame?

Upvotes: 2

hatmatrix
hatmatrix

Reputation: 44892

Some in the community heavily discourage the use of regular expressions to parse text containing an arbitrary number of nested expressions. R does have an XML parser (also applicable for HTML) which you might consider using for this purpose.

Upvotes: 5

Related Questions