limitingfactor
limitingfactor

Reputation: 77

How to match everything except for digits followed by a space and ONLY digits followed by a space?

The problem

What the header says, basically. Given a string, I need to extract from it everything that is not a leading number followed by a space. So, given this string

"420 species of grass"

I would like to get

"species of grass"

But, given a string with a number not in the beginning, like so

"The clock says it is 420"

or a string with a number not followed by a space, like so

"It is 420 already"

I would like to get back the same string, with the number preserved

"The clock says it is 420"
"It is 420 already"

What I have tried

Matching a leading number followed by a space works as expected:

library(stringr)
str_extract_all("420 species of grass", "^\\d+(?=\\s)")
[[1]]
[1] "420"
> str_extract_all("The clock says it is 420", "^\\d+(?=\\s)")
[[1]]
character(0)
> str_extract_all("It is 420 already", "^\\d+(?=\\s)")
[[1]]
character(0)

But, when I try to match anything but a leading number followed by a space, it doesn't:

> str_extract_all("420 species of grass", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "species" "of"      "grass"  
> str_extract_all("The clock says it is 420", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "The"   "clock" "says"  "it"    "is" 
> str_extract_all("It is 420 already", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "It"      "is"      "already"

It seems this regex matches anything but digits AND spaces instead.

How do I fix this?

Upvotes: 0

Views: 151

Answers (3)

Pushpesh Kumar Rajwanshi
Pushpesh Kumar Rajwanshi

Reputation: 18357

An easy way out is to replace any digits followed by spaces that occur right from the start of string using this regex,

^\d+\s+

with empty string.

Regex Demo using substitution

Sample R code using sub demo

sub("^\\d+\\s+", "", "420 species of grass")
sub("^\\d+\\s+", "", "The clock says it is 420")
sub("^\\d+\\s+", "", "It is 420 already")

Prints,

[1] "species of grass"
[1] "The clock says it is 420"
[1] "It is 420 already"

Alternative way to achieve same using matching, you can use following regex and capture contents of group1,

^(?:\d+\s+)?(.*)$

Regex Demo using match

Also, anything you place inside a character set looses its special meaning like positive lookahead inside it [^(^\\d+(?=\\s))]+ and simply behaves as a literal, so your regex becomes incorrect.

Edit:

Although solution using sub is better but in case you want match based solution using R codes, you need to use str_match instead of str_extract_all and for accessing group1 contents you need to use [,2]

R Code Demo using match

library(stringr)

print(str_match("420 species of grass", "^(?:\\d+\\s+)?(.*)$")[,2])
print(str_match("The clock says it is 420", "^(?:\\d+\\s+)?(.*)$")[,2])
print(str_match("It is 420 already", "^(?:\\d+\\s+)?(.*)$")[,2])

Prints,

[1] "species of grass"
[1] "The clock says it is 420"
[1] "It is 420 already"

Upvotes: 1

Chuan
Chuan

Reputation: 677

I think @Douglas's answer is more concise, however, I guess your actual case would be more complicated and you may want to check ?regexpr which can identify the starting position of your specific pattern.

A method using for loop is below:


list <- list("420 species of grass",
               "The clock says it is 420",
               "It is 420 already")

extract <- function(x) {
  y <- vector('list', length(x))
  for (i in seq_along(x)) {
    if (regexpr("420", x[[i]])[[1]] > 1) {
      y[[i]] <- x[[i]]
       }
    else{
      y[[i]] <- substr(x[[i]], (regexpr(" ", x[[i]])[[1]] + 1), nchar(x[[i]]))

    }
  }
  return(y)
}

> extract(list)
[[1]]
[1] "species of grass"

[[2]]
[1] "The clock says it is 420"

[[3]]
[1] "It is 420 already"

Upvotes: 2

Douglas Mesquita
Douglas Mesquita

Reputation: 1021

I think the easiest way to do this is by removing the numbers instead of extracting the desired pattern:

library(stringr)

strings <- c("420 species of grass", "The clock says it is 420", "It is 420 already")
str_remove(strings, pattern = "^\\d+\\s")

[1] "species of grass"         "The clock says it is 420" "It is 420 already"

Upvotes: 1

Related Questions