Extracting data from html pages using regex

Question

I feel like I'm very close to a solution here but can't seem to figure out why I'm not getting any result. I have an html page and I'm trying to parse out some IDs from it. I'm 99% certain my regex code is right, but for some reason I'm not getting any output.

In the html source, there are many ids that are wrapped with text like: /boardgame/9999/asdf. My regex code should pull out the /9999/ bit, but I can't figure out why it's just returning the same input html character string that I put in.

library(RCurl)
library(XML)
url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.parse <- sub("boardgame(.*?)[a-z]", "\1", html)

Any thoughts?

Damiano Fantini · Accepted Answer

I think your pattern was not accurate. In this case, you were picking up also other words, starting with "boardgames,". This should work for one single ID.

id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)

In my hands, it returns:

[1] "226501"

Also, I found many IDs in this html page. To catch them all in one list, you could do as follows.

url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.list <- list()
while (regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html) > 0) {
  id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
  my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
  id.list[[(length(id.list) + 1)]] <- gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)
  html <- substr(html, (id.pos + attributes(id.pos)$match.length), nchar(html))
}
id.list

Extracting data from html pages using regex

Answers (1)

Related Questions