paulie.jvenuez
paulie.jvenuez

Reputation: 295

R string removes punctuation on split

Say I have a string for example the following.

x <- 'The world is at end. What do you think?   I am going crazy!    These people are too calm.'

I need to split only on the punctuation !?. and following whitespace and keep the punctuation with it.

This removes the punctuation and leaves leading spaces in the split parts though

vec <- strsplit(x, '[!?.][:space:]*')

How can I split sentences leaving the punctuation?

Upvotes: 8

Views: 3520

Answers (5)

hwnd
hwnd

Reputation: 70732

You can switch on PCRE by using perl=TRUE and use a lookbehind assertion.

strsplit(x, '(?<![^!?.])\\s+', perl=TRUE)

Regular expression:

(?<!          look behind to see if there is not:
 [^!?.]       any character except: '!', '?', '.'
)             end of look-behind
\s+           whitespace (\n, \r, \t, \f, and " ") (1 or more times)

Live Demo

Upvotes: 14

Tyler Rinker
Tyler Rinker

Reputation: 109964

As of qdap version 1.1.0 you can use the sent_detect function as follows:

library(qdap)
sent_detect(x)

## [1] "The world is at end."       "What do you think?"        
## [3] "I am going crazy!"          "These people are too calm."

Upvotes: 1

ndr
ndr

Reputation: 1437

You could replace the spaces following punctuation marks with a string, e.g zzzzz and then split on that string.

x <- gsub("([!?.])[[:space:]]*","\\1zzzzz","The world is at end. What do you think?   I am going crazy!    These people are too calm.")
strsplit(x, "zzzzz")

Where \1 in the replacement string refers to the parenthesized sub-expression of the pattern.

Upvotes: 1

Blue Magister
Blue Magister

Reputation: 13363

Take a look at this question. Character classes like [:space:] are defined within bracket expressions, so you need to enclose it in a set of brackets. Try:

vec <- strsplit(x, '[!?.][[:space:]]*')
vec
# [[1]]
# [1] "The world is at end"       "What do you think"        
# [3] "I am going crazy"          "These people are too calm"

This gets rid of the leading spaces. To keep punctuation, use a positive lookbehind assertion with perl = TRUE:

vec <- strsplit(x, '(?<=[!?.])[[:space:]]*', perl = TRUE)
vec
# [[1]]
# [1] "The world is at end."       "What do you think?"        
# [3] "I am going crazy!"          "These people are too calm."

Upvotes: 2

Tyler Rinker
Tyler Rinker

Reputation: 109964

The sentSplit function in the qdap package was create just for this task:

library(qdap)
sentSplit(data.frame(text = x), "text")

##   tot                       text
## 1 1.1       The world is at end.
## 2 2.2         What do you think?
## 3 3.3          I am going crazy!
## 4 4.4 These people are too calm.

Upvotes: 6

Related Questions