Reputation: 70
I'm new to R and unable to find other threads with a similar issue.
I'm cleaning data that requires punctuation at the end of each line. I am unable to add, say, a period without overwriting the final character of the line preceding the carriage return + line feed.
Sample code:
Data1 <- "%trn: dads sheep\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
Data2 <- gsub("[^[:punct:]]\r\n\\*", ".\r\n\\*", Data1)
The contents of Data2:
[1] "%trn: dads shee.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
Notice the "p" of sheep was overwritten with the period. Any thoughts on how I could avoid this?
Upvotes: 2
Views: 442
Reputation: 70732
Use a capturing group around your character class and reference the group inside of your replacement.
gsub('([^[:punct:]])\\r\\n\\*', '\\1.\r\n*', Data1)
^ ^ ^^^
# [1] "%trn: dads sheep.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
You can switch on PCRE
by using perl=T
and use lookarounds to achieve this.
gsub('[^\\pP]\\K(?=\\r\\n\\*)', '.', Data1, perl=T)
# [1] "%trn: dads sheep.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
The negated Unicode property \pP
class matches any character except any kind of punctuation character.
Instead of using a capturing group, I used \K
here. This escape sequence resets the starting point of the reported match. Any previously matched characters are not included in the final matched sequence. As well, I used a Positive Lookahead to assert that a carriage return, newline sequence and a literal asterisk character follows.
Upvotes: 2
Reputation: 5271
There are several ways to do it:
Capture group:
gsub("([^[:punct:]])\\r\\n\\*", "\\1.\r\n*", Data1)
Positive lookbehind (non-capturing group):
gsub("(?<=[^[:punct:]])\\r\\n\\*", ".\r\n*", Data1, perl=T)
EDIT: fixed the backslashes and removed the uncertainty about R support for these.
Upvotes: 1