ZZ123
ZZ123

Reputation: 25

R, selectively removing parts of a string

I'm having trouble thinking of an efficient way to remove parts from a string using R. I have text data that I'm reading into R. The data is in HTML, which kind of looks like this:

dummy <- c("Blah Blah 10pt margins blah blah 11pt blah format 23pt real answer34")

I'm trying to isolate only that "34", but, I can't simply pull out numbers because of all the "10pt" and "11pt" and "23pt" html formatting.

What I would like to do is, for every instance where I find the text "pt," remove the two characters in front of "pt". If I do that, I can get:

newDummy <- c("Blah Blah pt margins blah blah pt blah format pt real answer34")

Then I can get my answer of 34 via str_extract_all(newDummy,"\\(?[0-9,.]+\\)?") from the stringr library.

The problem is that I can't seem to efficiently turn "dummy" into "newDummy" -- does anyone have a neat solution?

Thanks!

Upvotes: 0

Views: 180

Answers (1)

akrun
akrun

Reputation: 887901

You could either use:

dummy <- c("Blah Blah 10pt margins blah blah 11pt blah format 23pt real answer34")
library(stringi)
stri_extract_all_regex(dummy,'\\d+?\\d(?!pt)')[[1]]
#[1] "34"

or

library(stringr)
str_extract_all(dummy, "\\b\\d+\\b")[[1]]
#[1] "34"

Update

dummy <- "10pt 11pt realanswer34"
stri_extract_all_regex(dummy,'\\d+?\\d(?!pt)')[[1]]
#[1] "34"

or using str_extract_all

str_extract_all(dummy,perl('\\d+?\\d(?!pt)'))[[1]]
#[1] "34"

Upvotes: 2

Related Questions