Reputation: 77
What the header says, basically. Given a string, I need to extract from it everything that is not a leading number followed by a space. So, given this string
"420 species of grass"
I would like to get
"species of grass"
But, given a string with a number not in the beginning, like so
"The clock says it is 420"
or a string with a number not followed by a space, like so
"It is 420 already"
I would like to get back the same string, with the number preserved
"The clock says it is 420"
"It is 420 already"
Matching a leading number followed by a space works as expected:
library(stringr)
str_extract_all("420 species of grass", "^\\d+(?=\\s)")
[[1]]
[1] "420"
> str_extract_all("The clock says it is 420", "^\\d+(?=\\s)")
[[1]]
character(0)
> str_extract_all("It is 420 already", "^\\d+(?=\\s)")
[[1]]
character(0)
But, when I try to match anything but a leading number followed by a space, it doesn't:
> str_extract_all("420 species of grass", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "species" "of" "grass"
> str_extract_all("The clock says it is 420", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "The" "clock" "says" "it" "is"
> str_extract_all("It is 420 already", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "It" "is" "already"
It seems this regex matches anything but digits AND spaces instead.
How do I fix this?
Upvotes: 0
Views: 151
Reputation: 18357
An easy way out is to replace any digits followed by spaces that occur right from the start of string using this regex,
^\d+\s+
with empty string.
sub("^\\d+\\s+", "", "420 species of grass")
sub("^\\d+\\s+", "", "The clock says it is 420")
sub("^\\d+\\s+", "", "It is 420 already")
Prints,
[1] "species of grass"
[1] "The clock says it is 420"
[1] "It is 420 already"
Alternative way to achieve same using matching, you can use following regex and capture contents of group1,
^(?:\d+\s+)?(.*)$
Also, anything you place inside a character set looses its special meaning like positive lookahead inside it [^(^\\d+(?=\\s))]+
and simply behaves as a literal, so your regex becomes incorrect.
Edit:
Although solution using sub
is better but in case you want match based solution using R codes, you need to use str_match
instead of str_extract_all
and for accessing group1 contents you need to use [,2]
library(stringr)
print(str_match("420 species of grass", "^(?:\\d+\\s+)?(.*)$")[,2])
print(str_match("The clock says it is 420", "^(?:\\d+\\s+)?(.*)$")[,2])
print(str_match("It is 420 already", "^(?:\\d+\\s+)?(.*)$")[,2])
Prints,
[1] "species of grass"
[1] "The clock says it is 420"
[1] "It is 420 already"
Upvotes: 1
Reputation: 677
I think @Douglas's answer is more concise, however, I guess your actual case would be more complicated and you may want to check ?regexpr
which can identify the starting position of your specific pattern.
A method using for
loop is below:
list <- list("420 species of grass",
"The clock says it is 420",
"It is 420 already")
extract <- function(x) {
y <- vector('list', length(x))
for (i in seq_along(x)) {
if (regexpr("420", x[[i]])[[1]] > 1) {
y[[i]] <- x[[i]]
}
else{
y[[i]] <- substr(x[[i]], (regexpr(" ", x[[i]])[[1]] + 1), nchar(x[[i]]))
}
}
return(y)
}
> extract(list)
[[1]]
[1] "species of grass"
[[2]]
[1] "The clock says it is 420"
[[3]]
[1] "It is 420 already"
Upvotes: 2
Reputation: 1021
I think the easiest way to do this is by removing the numbers instead of extracting the desired pattern:
library(stringr)
strings <- c("420 species of grass", "The clock says it is 420", "It is 420 already")
str_remove(strings, pattern = "^\\d+\\s")
[1] "species of grass" "The clock says it is 420" "It is 420 already"
Upvotes: 1