castaa95
castaa95

Reputation: 83

A Regex to remove digits except for words starting with #

I have some strings that can contain letters, numbers and '#' symbol.

I would like to remove digits except for the words that start with '#'

Here is an example:

"table9 dolv5e #10n #dec10 #nov8e 23 hello"

And the expected output is:

"table dolve #10n #dec10 #nov8e  hello"

How can I do this with regex, stringr or gsub?

Upvotes: 8

Views: 1083

Answers (5)

Transamunos
Transamunos

Reputation: 101

INPUT = "table9 dolv5e #10n #dec10 #nov8e 23 hello";
OUTPUT = INPUT.match(/[^#\d]+(#\w+|[A-Za-Z]+\w*)/gi).join('');

You can remove flags i, cause it was case insensitive

Use this pattern: [^#\d]+(#\w+|[A-Za-Z]+\w*)

[^#\d]+ = character start with no # and digits #\w+ = find # followed by digit or letter [A-Za-z]+\w* = find letter followed by letter and/or number ^ | You can change this with \D+\S* = find any character not just when the first is letter and not just followed by letter and/or number. I am not put as \w+\w* cause \w same as = [\w\d].

I tried the code in JavaScript and it work. If you want match not only followed by letter you can use code

Upvotes: 0

bobble bubble
bobble bubble

Reputation: 18565

How about capturing the wanted and replacing the unwanted with empty (non captured).

gsub("(#\\S+)|\\d+","\\1",x)

See demo at regex101 or R demo at tio.run (I have no experience with R)

My Answer is assuming, that there is always whitespace between #foo bar #baz2. If you have something like #foo1,bar2:#baz3 4, use \w (word character) instead of \S (non whitespace).

Upvotes: 5

hello_friend
hello_friend

Reputation: 5798

Base R solution:

unlisted_strings <- unlist(strsplit(X, "\\s+"))

Y <- paste0(na.omit(ifelse(grepl("[#]", unlisted_strings),

                           unlisted_strings,

                           gsub("\\d+", "", unlisted_strings))), collapse = " ")

Y 

Data:

X <- as.character("table9 dolv5e #10n #dec10 #nov8e 23 hello")

Upvotes: 0

user2474226
user2474226

Reputation: 1502

You could split the string on spaces, remove digits from tokens if they don't start with '#' and paste back:

x <- "table9 dolv5e #10n #dec10 #nov8e 23 hello"
y <- unlist(strsplit(x, ' '))
paste(ifelse(startsWith(y, '#'), y, sub('\\d+', '', y)), collapse = ' ')
# output 
[1] "table dolve #10n #dec10 #nov8e  hello"

Upvotes: 5

StupidWolf
StupidWolf

Reputation: 47008

You use gsub to remove digits, for example:

gsub("[0-9]","","table9")
"table"

And we can split your string using strsplit:

STRING = "table9 dolv5e #10n #dec10 #nov8e 23 hello"
strsplit(STRING," ")
[[1]]
[1] "table9" "dolv5e" "#10n"   "#dec10" "#nov8e" "23"     "hello"

We just need to iterate through STRING, with gsub, applying it only to elements that do not have "#"

STRING = unlist(strsplit(STRING," "))
no_hex = !grepl("#",STRING)
STRING[no_hex] = gsub("[0-9]","",STRING[no_hex])
paste(STRING,collapse=" ")
[1] "table dolve #10n #dec10 #nov8e  hello"

Upvotes: 1

Related Questions