Reputation: 1273
I feel like I'm very close to a solution here but can't seem to figure out why I'm not getting any result. I have an html page and I'm trying to parse out some IDs from it. I'm 99% certain my regex code is right, but for some reason I'm not getting any output.
In the html source, there are many ids that are wrapped with text like: /boardgame/9999/asdf. My regex code should pull out the /9999/ bit, but I can't figure out why it's just returning the same input html character string that I put in.
library(RCurl)
library(XML)
url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.parse <- sub("boardgame(.*?)[a-z]", "\\1", html)
Any thoughts?
Upvotes: 0
Views: 35
Reputation: 1975
I think your pattern was not accurate. In this case, you were picking up also other words, starting with "boardgames,". This should work for one single ID.
id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)
In my hands, it returns:
[1] "226501"
Also, I found many IDs in this html page. To catch them all in one list, you could do as follows.
url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.list <- list()
while (regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html) > 0) {
id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
id.list[[(length(id.list) + 1)]] <- gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)
html <- substr(html, (id.pos + attributes(id.pos)$match.length), nchar(html))
}
id.list
Upvotes: 1