Reputation: 333
I have just begun using the XML
package in R and I'm having trouble extracting a string from xml list:
> library("XML")
> library("stringr")
> url = "html-1.html"
> parsed_doc = htmlParse(file=url, useInternalNodes = TRUE)
> products <- getNodeSet(doc = parsed_doc, path = "//li[contains(.,Product ID')]")
> products
[[1]]
<li>
Product ID:
000002434482
</li>
[[2]]
<li>
Product ID:
000002183105
</li>
[[3]]
<li>
Product ID:
000002183105
</li>
I would like to create a vector containing each ID. I have tried a number of regularized expressions to extract the 12 digit product ID but can't seem to get it to work.
> mrn <- str_extract(products , "[[:digit:]{12}")
> mrn <- str_extract(products , "[[:digit:]+
]")
> mrn <- str_extract(products , "[0-9]+
")
I wondered if the list structure had something to do with it or maybe the spacing?
I have also tried > mrn <- str_extract(products , ".{16}")
however, R returns pointer values such as "<pointer: 0x6815"
- I think this is close but I'm not sure what this means.
Upvotes: 1
Views: 1449
Reputation: 627022
You are almost there. The node set is not a string vector, you need to get the string values out of it first. You can easily extract them with xmlValue
, and then you can use str_extract
(or str_extract_all
):
> v <- sapply(products, xmlValue)
> v
[1] "\r\n Product ID:\r\n 000002434482\r\n"
[2] "\r\n Product ID:\r\n 000002183105\r\n"
[3] "\r\n Product ID:\r\n 000002183105\r\n "
> unlist(str_extract_all(v, "[[:digit:]]+"))
[1] "000002434482" "000002183105" "000002183105"
If the IDs are whole words containing 12 digits only, you can use a more precise expression like
"\\b[[:digit:]]{12}\\b"
where \b
is a word boundary and {12}
is a limiting quantifier matching exactly 12 occurrences of a digit.
Alternatively, you can also extract these IDs with str_match
and Product ID:\s*(\d{12})\b
regex that matches Product ID:
+ space(s) + the 12-digit whole word number (that is captured, thus, we need to use str_match
rather than str_extract
):
> res <- unlist(str_match(v, "Product ID:\\s*(\\d{12})\\b"))
> res[,2]
[1] "000002434482" "000002183105" "000002183105"
Upvotes: 1