Alex
Alex

Reputation: 19803

Remove spaces between words of a certain length

I have strings of the following variety:

A B C Company
XYZ Inc
S & K Co

I would like to remove the spaces in these strings that are only between words of 1 letter length. For example, in the first string I would like to remove the spaces between A B and C but not between C and Company. The result should be:

ABC Company
XYZ Inc
S&K Co

What is the proper regex expression to use in gsub for this?

Upvotes: 14

Views: 2862

Answers (5)

Rich Scriven
Rich Scriven

Reputation: 99331

Obligatory strsplit / paste answer. This will also get those single characters that might be in the middle or at the end of the string.

x <- c('A B C Company', 'XYZ Inc', 'S & K Co', 
       'A B C D E F G Company', 'Company A B C', 'Co A B C mpany')

foo <- function(x) {
    x[nchar(x) == 1L] <- paste(x[nchar(x) == 1L], collapse = "")
    paste(unique(x), collapse = " ")
}

vapply(strsplit(x, " "), foo, character(1L))
# [1] "ABC Company"     "XYZ Inc"         "S&K Co"         
# [4] "ABCDEFG Company" "Company ABC"     "Co ABC mpany"

Upvotes: 10

Avinash Raj
Avinash Raj

Reputation: 174696

You could do this also through PCRE verb (*SKIP)(*F)

> x <- c('A B C Company', 'XYZ Inc', 'S & K Co', 'A B C D E F G Company', ' H & K')
> gsub("\\s*\\S\\S+\\s*(*SKIP)(*F)|(?<=\\S)\\s+(?=\\S)", "", x, perl=TRUE)
[1] "ABC Company"     "XYZ Inc"         "S&K Co"          "ABCDEFG Company"
[5] " H&K"

Explanation:

  • \\s*\\S\\S+\\s* Would match two or more non-space characters along with the preceding and following spaces.
  • (*SKIP)(*F) Causes the match the to fail.
  • | Now ready to choose the characters from the remaining string.
  • (?<=\\S)\\s+(?=\\S) one or more spaces which are preceded by a non-space , followed by a non-space character are matched.
  • Removing the spaces will give you the desired output.

Note: See the last element, this regex won't replace the preceding spaces at the first because the spaces at the start isn't preceded by a single non-space character.

Upvotes: 0

walid toumi
walid toumi

Reputation: 2272

Another option

(?![ ]+\\S\\S)[ ]+

Upvotes: 1

hwnd
hwnd

Reputation: 70722

Here is one way you could do this seeing how & is mixed in and not a word character ...

x <- c('A B C Company', 'XYZ Inc', 'S & K Co', 'A B C D E F G Company')
gsub('(?<!\\S\\S)\\s+(?=\\S(?!\\S))', '', x, perl=TRUE)
# [1] "ABC Company"     "XYZ Inc"         "S&K Co"          "ABCDEFG Company"

Explanation:

First we assert that two non-whitespace characters do not precede back to back. Then we look for and match whitespace "one or more" times. Next we lookahead to assert that a non-whitespace character follows while asserting that the next character is not a non-whitespace character.

(?<!        # look behind to see if there is not:
  \S        #   non-whitespace (all but \n, \r, \t, \f, and " ")
  \S        #   non-whitespace (all but \n, \r, \t, \f, and " ")
)           # end of look-behind
\s+         # whitespace (\n, \r, \t, \f, and " ") (1 or more times)
(?=         # look ahead to see if there is:
  \S        #   non-whitespace (all but \n, \r, \t, \f, and " ")
  (?!       #   look ahead to see if there is not:
    \S      #     non-whitespace (all but \n, \r, \t, \f, and " ")
  )         #   end of look-ahead
)           # end of look-ahead

Upvotes: 19

alpha bravo
alpha bravo

Reputation: 7948

Coming late to the game but would this pattern work for you

(?<!\\S\\S)\\s+(?!\\S\\S)

Demo

Upvotes: 7

Related Questions