Tyler Rinker
Tyler Rinker

Reputation: 109844

Regex to remove everything except letters and remove multiple spaces

I'm trying to make a single regex to remove everything except:

  1. letters
  2. apostrophe's
  3. single spaces

I tried ([^\\p{L} ']+ with a Lookbehind for the extra spaces (?<=\\s)\\s+. Each works in isolation:

gsub("(?<=\\s)\\s+", "", "I like 56 dogs that's him55.", perl = TRUE)
## [1] "I like 56 dogs that's him55."

gsub("[^\\p{L} ']+", "", "I like 56 dogs that's him55.", perl = TRUE)
## [1] "I like  dogs that's him"

But when I use or (|) to connect them:

gsub("((?<=\\s)\\s+)|([^\\p{L} ']+)", "", "I like 56 dogs that's him55.", perl = TRUE)

This returns:

[1] "I like  dogs that's him"

I'd like it to remove the multiple extra space (between like & dogs) like:

[1] "I like dogs that's him"

How can I use one regex to remove everything except letters, apostrophes and extra spaces?

Upvotes: 1

Views: 570

Answers (2)

hwnd
hwnd

Reputation: 70722

You can try the following if you're trying to do this in one call:

gsub("[^\\pL' ]+\\h+(?=\\h)|\\h+(?=[^\\pL' ]+)|[^\\pL' ]+", "", x, perl=T)
# [1] "I like dogs that's him"

Here is another way you could approach this if you desire which is more efficient IMO.

x <- "I like 56 dogs that's him55."
r <- gsub("[^\\pL' ]+", '', x, perl=T)
paste(strsplit(r, '\\s+')[[1]], collapse = ' ')
# [1] "I like dogs that's him"

Upvotes: 2

maque
maque

Reputation: 686

It seems like issue comes from having space in you regex which turns each number into space, code bellow worked fine for me:

gsub("[^\\p{L}']+", " ", "I like 56 dogs that's him55.", perl = TRUE)

Upvotes: 2

Related Questions