Reputation: 25
I'm having trouble thinking of an efficient way to remove parts from a string using R. I have text data that I'm reading into R. The data is in HTML, which kind of looks like this:
dummy <- c("Blah Blah 10pt margins blah blah 11pt blah format 23pt real answer34")
I'm trying to isolate only that "34", but, I can't simply pull out numbers because of all the "10pt" and "11pt" and "23pt" html formatting.
What I would like to do is, for every instance where I find the text "pt," remove the two characters in front of "pt". If I do that, I can get:
newDummy <- c("Blah Blah pt margins blah blah pt blah format pt real answer34")
Then I can get my answer of 34 via str_extract_all(newDummy,"\\(?[0-9,.]+\\)?")
from the stringr library.
The problem is that I can't seem to efficiently turn "dummy" into "newDummy" -- does anyone have a neat solution?
Thanks!
Upvotes: 0
Views: 180
Reputation: 887901
You could either use:
dummy <- c("Blah Blah 10pt margins blah blah 11pt blah format 23pt real answer34")
library(stringi)
stri_extract_all_regex(dummy,'\\d+?\\d(?!pt)')[[1]]
#[1] "34"
or
library(stringr)
str_extract_all(dummy, "\\b\\d+\\b")[[1]]
#[1] "34"
dummy <- "10pt 11pt realanswer34"
stri_extract_all_regex(dummy,'\\d+?\\d(?!pt)')[[1]]
#[1] "34"
or using str_extract_all
str_extract_all(dummy,perl('\\d+?\\d(?!pt)'))[[1]]
#[1] "34"
Upvotes: 2