Reputation: 2074
I reformatting some character vectors, but there are a few anomalies in formatting that I unexpectedly have to deal with. Here's an example of a string that's will be reformatted:
t <- "COZ009 - 013 - 016 - 018 034>036 - 039>040 - 066>081"
The problem is that a hyphen is missing here "...018 034>036...". It should be "...018 - 034>036...".
I'd like to add a hyphen using a simple base function like gsub
, but how to replace the space that is missing the hyphen without also touching all the other spaces? That is how to make a replacement conditional on the surrounding characters?
The closest I've been able to come up with is.
t2 <- gsub(" - ", "-", t)
gsub(" ", "-", t2)
[1] "COZ009-013-016-018-034>036-039>040-066>081"
It could be there's nothing wrong with this solution, but it would be nice to know how to replace conditionally.
Upvotes: 1
Views: 916
Reputation: 269654
1) Replace any space that is surrounded by word boundaries with space, minus, space:
gsub("\\b \\b", " - ", t)
## [1] "COZ009 - 013 - 016 - 018 - 034>036 - 039>040 - 066>081"
2) Another simple approach is to replace any sequence of spaces and minus signs with space, minus, space:
gsub("[ -]+", " - ", t)
## [1] "COZ009 - 013 - 016 - 018 - 034>036 - 039>040 - 066>081"
2a) A variation of this would be to use strsplit
sapply(strsplit(t, "[ -]+"), paste, collapse = " - ")
## [1] "COZ009 - 013 - 016 - 018 - 034>036 - 039>040 - 066>081"
3) Another possibility is to replace space, minus, space with space and then replace all spaces with space, minus, space.
tmp <- gsub(" - ", " ", t)
gsub(" ", " - ", tmp)
## [1] "COZ009 - 013 - 016 - 018 034>036 - 039>040 - 066>081"
4) Another simple possibility is to replace space, minus, space with some character that does not occur such as semicolon. Then replace space with space, minus, space and then revert the semicolons back. In this case (3) seems similar but simpler but if you had to replace the original space with something else then this one might be preferred to (3).
tmp <- gsub(" - ", ";", t)
tmp <- gsub(" ", " - ", tmp)
gsub(";", " - ", t)
## [1] "COZ009 - 013 - 016 - 018 034>036 - 039>040 - 066>081"
Update: New (1) plus add additional alternatives.
Upvotes: 2
Reputation: 2753
The answer by @G5W works - I'd modify the code to include only certain string lengths:
> gsub("([[:digit:]]{1,3})[[:space:]]{1,2}([[:digit:]]{1,3})", "\\1 - \\2", p)
[1] "COZ009 - 013 - 016 - 018 - 034>036 - 039>040 - 066>081"
the above code looks specifically for patterns where the preceding and following string are comprised of 1-3 digits and are separated by no more than two spaces.
Upvotes: 1
Reputation: 37641
You can specify that the surrounding characters are digits and use capture groups so that you do not remove them.
gsub("(\\d)\\s+(\\d)", "\\1 - \\2", t)
[1] "COZ009 - 013 - 016 - 018 - 034>036 - 039>040 - 066>081"
Here the parentheses surrounding the digits stores them in variables \1 & \2, so you can avoid changing them.
Upvotes: 5