Tyler Rinker
Tyler Rinker

Reputation: 109874

Eliminate space before period unless followed by a digit

How can I use R's regex to eliminate space(s) before period(s) unless period is followed by a digit?

Here's what I have and what I've tried:

x <- c("I have .32 dollars AKA 32 cents . ", 
    "I have .32 dollars AKA 32 cents .  Hello World .")

gsub("(\\s+)(?=\\.+)", "", x, perl=TRUE)
gsub("(\\s+)(?=\\.+)(?<=[^\\d])", "", x, perl=TRUE)

This gives (no space before .32):

## [1] "I have.32 dollars AKA 32 cents. "             
## [2] "I have.32 dollars AKA 32 cents.  Hello World."

I'd like to get:

## [1] "I have .32 dollars AKA 32 cents. "             
## [2] "I have .32 dollars AKA 32 cents.  Hello World."

I'm saddled with gsub here but other solutions welcomed to make the question more usable to future searchers.

Upvotes: 4

Views: 520

Answers (4)

hwnd
hwnd

Reputation: 70732

You don't need a complex expression, you can use a Positive Lookahead here.

> gsub(' +(?=\\.(?:\\D|$))', '', x, perl=T)
## [1] "I have .32 dollars AKA 32 cents. "             
## [2] "I have .32 dollars AKA 32 cents.  Hello World."

Explanation:

 +        # ' ' (1 or more times)
(?=       # look ahead to see if there is:
  \.      #   '.'
  (?:     #   group, but do not capture:
    \D    #      non-digits (all but 0-9)
   |      #     OR
    $     #      before an optional \n, and the end of the string
  )       #   end of grouping
)         # end of look-ahead

Note: If these space characters could be any type of whitespace just replace ' '+ with \s+


If you are content with using the (*SKIP)(*F) backtracking verbs, here is the correct representation:

> gsub(' \\.\\d(*SKIP)(*F)| +(?=\\.)', '', x, perl=T)
## [1] "I have .32 dollars AKA 32 cents. "             
## [2] "I have .32 dollars AKA 32 cents.  Hello World."

Upvotes: 4

Brian Stephens
Brian Stephens

Reputation: 5271

Well, I don't know r, but I know regular expressions. Hopefully this answer works in r.

gsub("\\s+\\.(?!\\d)", ".", x, perl=TRUE)

It uses a negative lookahead to ensure that the space(s) and period are not followed by a digit; then it replaces the match with just a period.

Upvotes: 3

Andrie
Andrie

Reputation: 179448

Try this regex:

x <- c("I have .32 dollars AKA 32 cents . ", 
       "I have .32 dollars AKA 32 cents .  Hello World .",
       "I have .32 dollars AKA 32 cents .  Hello World .xyz")

gsub(" *\\.($|\\D)", "\\.\\1", x)
[1] "I have .32 dollars AKA 32 cents. "                
[2] "I have .32 dollars AKA 32 cents.  Hello World."   
[3] "I have .32 dollars AKA 32 cents.  Hello World.xyz"

What it does:

  • " *\\." searches for a any number of spaces followed by a period.
  • "($|\\D)" searches for either:
    • the end of the line ($),
    • or "not a digit" (\\D)

Upvotes: 2

akrun
akrun

Reputation: 887291

This seems to work for the example.

  gsub("\\s(?=\\.[0-9])(*SKIP)(*F)|(\\s+)(?=\\.+)(?<=[^\\d])", "", x, perl=TRUE)
  #[1] "I have .32 dollars AKA 32 cents. "             
  #[2] "I have .32 dollars AKA 32 cents.  Hello World."

Upvotes: 2

Related Questions