Xiangyu
Xiangyu

Reputation: 45

R: Remove dots in text but not those marking decimal points

I am a new comer to regular expressions, so please bear with me.

I have a string like this:

txt1 <- 'a,b,a.b,a.,1,2,1.2,1.,.,11,222,11.222,11.'

Imagine it is from a .csv and each cell is separated by ','. Now I would like to remove all '.' except those marking decimal points. In this end, I'd like to end up with something like this:

txt2 <- 'a,b,ab,a,1,2,1.2,1,,11,222,11.222,11'

I have tried the following codes:

txt2 <- gsub(pattern = '[^a-z0-9,(\\d\\.\\d)]', replacement = '', text = txt1)
txt2 <- gsub(pattern = '[^a-z0-9,|(\\d\\.\\d)]', replacement = '', text = txt1)

But neither works, both returning

> print(txt2)
[1] "a,b,a.b,a.,1,2,1.2,1.,.,11,222,11.222,11."

Any idea how I might correct my codes? Thanks!

Upvotes: 3

Views: 946

Answers (3)

zambonee
zambonee

Reputation: 1647

The key is to use the negative lookbehind ?<! and negative lookahead ?!

> txt1 <- 'a,b,a.b,a.,1,2,1.2,1.,.,11,222,11.222,11.'
> txt2 <- gsub(pattern='((?<![0-9])\\.)|(\\.(?![0-9]))', replacement='', x=txt1, perl=TRUE)
> txt2
[1] "a,b,ab,a,1,2,1.2,1,,11,222,11.222,11"

This pattern matchs a period \\. that is proceeded by a character that is not 0-9 or a period that is followed by a character that is not 0-9. You have to set perl=TRUE for R to recognize the lookbehind and lookahead.

This will trim leading period characters, so '.2' will become '2'. If this is not wanted, the lookbehind needs to be (?<![0-9,]).

Upvotes: 4

Maurits Evers
Maurits Evers

Reputation: 50668

Negative lookahead (as suggested by @CAustin) seems to be the most elegant and concise.

Since none of the above solutions give you the actual R code, here it is:

txt2 <- gsub("\\.(?!\\d)", "", txt1, perl = TRUE)
[1] "a,b,ab,a,1,2,1.2,1,,11,222,11.222,11"

Upvotes: 0

CAustin
CAustin

Reputation: 4614

You can use negative lookahead. Match \.(?!\d) and replace it with nothing.

https://regex101.com/r/LNHYOY/1

Upvotes: 0

Related Questions