Martin Feldkircher
Martin Feldkircher

Reputation: 21

I want to write a regex in R to remove all words of a string containing numbers

For example:

x<-"Saint  A/74/PV.46 12/12/2019 4/66 19-40538 Lucia"

Should give me "Saint Lucia".

I tried

trimws(gsub("\\w*[0-9]+\\w*\\s*", "", x))

which gave me

Saint  A//PV.///-Lucia

Any help would be very much appreciated.

Upvotes: 2

Views: 260

Answers (3)

B. Christian Kamgang
B. Christian Kamgang

Reputation: 6529

You could use gsub to replace the characters starting from the first space(" ") to the last space with a single space.

x <- "Saint  A/74/PV.46 12/12/2019 4/66 19-40538 Lucia"
gsub(" .+ ", " ", x)
[1] "Saint Lucia"

Upvotes: 0

akrun
akrun

Reputation: 887971

We could use gsub to match letters, digits, from a word boundary (\\b) to the next, and replace with blank ("")

gsub("\\s{2,}", " ", gsub("\\b[A-Z/0-9.-]+\\b", "", x))
#[1] "Saint Lucia"

Or using str_extract

library(stringr)
str_c(str_extract_all(x, "(?<= |^)[[:alpha:]]+(?= |$)")[[1]], collapse = " ")
#[1] "Saint Lucia"

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627536

You can use a replacing approach:

x<-"Saint  A/74/PV.46 12/12/2019 4/66 19-40538 Lucia"
gsub("\\s*(?<!\\S)(?!\\p{L}+(?!\\S))\\S+", "", x, perl=TRUE)
## => [1] "Saint Lucia"
library(stringr)
str_replace_all(x, "\\s*(?<!\\S)(?!\\p{L}+(?!\\S))\\S+", "")
## => [1] "Saint Lucia"

See the R demo. See the regex demo. Details:

  • \s* - zero or more whitespaces
  • (?<!\S) - start of string or a position immediately preceded with a whitespace
  • (?!\p{L}+(?!\S)) - the next non-whitespace chunk cannot be a letter only word
  • \S+ - one or more non-whitespace chars.

Or, you may match all letter only words in between whitespace boundaries and join the matches with a space:

paste(unlist(regmatches(x, gregexpr("(?<!\\S)\\p{L}+(?!\\S)", x, perl=TRUE))), collapse=" ")

See the R demo online. Also, see the regex demo, it matches

  • (?<!\S) - a position at the start of string or right after a whitespace
  • \p{L}+ - one or more Unicode letters
  • (?!\S) - immediately on the right, there must be a whitespace or end of string.

Upvotes: 1

Related Questions