Reputation: 109844
I'm trying to make a single regex to remove everything except:
I tried ([^\\p{L} ']+
with a Lookbehind for the extra spaces (?<=\\s)\\s+
. Each works in isolation:
gsub("(?<=\\s)\\s+", "", "I like 56 dogs that's him55.", perl = TRUE)
## [1] "I like 56 dogs that's him55."
gsub("[^\\p{L} ']+", "", "I like 56 dogs that's him55.", perl = TRUE)
## [1] "I like dogs that's him"
But when I use or (|
) to connect them:
gsub("((?<=\\s)\\s+)|([^\\p{L} ']+)", "", "I like 56 dogs that's him55.", perl = TRUE)
This returns:
[1] "I like dogs that's him"
I'd like it to remove the multiple extra space (between like & dogs) like:
[1] "I like dogs that's him"
How can I use one regex to remove everything except letters, apostrophes and extra spaces?
Upvotes: 1
Views: 570
Reputation: 70722
You can try the following if you're trying to do this in one call:
gsub("[^\\pL' ]+\\h+(?=\\h)|\\h+(?=[^\\pL' ]+)|[^\\pL' ]+", "", x, perl=T)
# [1] "I like dogs that's him"
Here is another way you could approach this if you desire which is more efficient IMO.
x <- "I like 56 dogs that's him55."
r <- gsub("[^\\pL' ]+", '', x, perl=T)
paste(strsplit(r, '\\s+')[[1]], collapse = ' ')
# [1] "I like dogs that's him"
Upvotes: 2
Reputation: 686
It seems like issue comes from having space in you regex which turns each number into space, code bellow worked fine for me:
gsub("[^\\p{L}']+", " ", "I like 56 dogs that's him55.", perl = TRUE)
Upvotes: 2