spearmint
spearmint

Reputation: 70

Cleaning strings in R: add punctuation w/o overwriting last character

I'm new to R and unable to find other threads with a similar issue.

I'm cleaning data that requires punctuation at the end of each line. I am unable to add, say, a period without overwriting the final character of the line preceding the carriage return + line feed.

Sample code:

Data1 <- "%trn: dads sheep\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"
Data2 <- gsub("[^[:punct:]]\r\n\\*", ".\r\n\\*", Data1)

The contents of Data2:

[1] "%trn: dads shee.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"

Notice the "p" of sheep was overwritten with the period. Any thoughts on how I could avoid this?

Upvotes: 2

Views: 442

Answers (2)

hwnd
hwnd

Reputation: 70732

Capturing group:

Use a capturing group around your character class and reference the group inside of your replacement.

gsub('([^[:punct:]])\\r\\n\\*', '\\1.\r\n*', Data1)
      ^            ^             ^^^
# [1] "%trn: dads sheep.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"

Lookarounds:

You can switch on PCRE by using perl=T and use lookarounds to achieve this.

gsub('[^\\pP]\\K(?=\\r\\n\\*)', '.', Data1, perl=T)
# [1] "%trn: dads sheep.\r\n*MOT: hunn.\r\n%trn: yes.\r\n*MOT: ana mu\r\n%trn: where is it?"

The negated Unicode property \pP class matches any character except any kind of punctuation character.

Instead of using a capturing group, I used \K here. This escape sequence resets the starting point of the reported match. Any previously matched characters are not included in the final matched sequence. As well, I used a Positive Lookahead to assert that a carriage return, newline sequence and a literal asterisk character follows.

Upvotes: 2

Brian Stephens
Brian Stephens

Reputation: 5271

There are several ways to do it:

Capture group: gsub("([^[:punct:]])\\r\\n\\*", "\\1.\r\n*", Data1)

Positive lookbehind (non-capturing group): gsub("(?<=[^[:punct:]])\\r\\n\\*", ".\r\n*", Data1, perl=T)

EDIT: fixed the backslashes and removed the uncertainty about R support for these.

Upvotes: 1

Related Questions