R, selectively removing parts of a string

Question

I'm having trouble thinking of an efficient way to remove parts from a string using R. I have text data that I'm reading into R. The data is in HTML, which kind of looks like this:

dummy <- c("Blah Blah 10pt margins blah blah 11pt blah format 23pt real answer34")

I'm trying to isolate only that "34", but, I can't simply pull out numbers because of all the "10pt" and "11pt" and "23pt" html formatting.

What I would like to do is, for every instance where I find the text "pt," remove the two characters in front of "pt". If I do that, I can get:

newDummy <- c("Blah Blah pt margins blah blah pt blah format pt real answer34")

Then I can get my answer of 34 via str_extract_all(newDummy,"$?[0-9,.]+$?") from the stringr library.

The problem is that I can't seem to efficiently turn "dummy" into "newDummy" -- does anyone have a neat solution?

Thanks!

akrun · Accepted Answer

You could either use:

dummy <- c("Blah Blah 10pt margins blah blah 11pt blah format 23pt real answer34")
library(stringi)
stri_extract_all_regex(dummy,'\d+?\d(?!pt)')[[1]]
#[1] "34"

or

library(stringr)
str_extract_all(dummy, "\b\d+\b")[[1]]
#[1] "34"

Update

dummy <- "10pt 11pt realanswer34"
stri_extract_all_regex(dummy,'\d+?\d(?!pt)')[[1]]
#[1] "34"

or using str_extract_all

str_extract_all(dummy,perl('\d+?\d(?!pt)'))[[1]]
#[1] "34"

R, selectively removing parts of a string

Answers (1)

Update

Related Questions