Reputation: 295
Say I have a string for example the following.
x <- 'The world is at end. What do you think? I am going crazy! These people are too calm.'
I need to split only on the punctuation !?.
and following whitespace and keep the punctuation with it.
This removes the punctuation and leaves leading spaces in the split parts though
vec <- strsplit(x, '[!?.][:space:]*')
How can I split sentences leaving the punctuation?
Upvotes: 8
Views: 3520
Reputation: 70732
You can switch on PCRE
by using perl=TRUE
and use a lookbehind assertion.
strsplit(x, '(?<![^!?.])\\s+', perl=TRUE)
Regular expression:
(?<! look behind to see if there is not:
[^!?.] any character except: '!', '?', '.'
) end of look-behind
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or more times)
Upvotes: 14
Reputation: 109964
As of qdap version 1.1.0 you can use the sent_detect
function as follows:
library(qdap)
sent_detect(x)
## [1] "The world is at end." "What do you think?"
## [3] "I am going crazy!" "These people are too calm."
Upvotes: 1
Reputation: 1437
You could replace the spaces following punctuation marks with a string, e.g zzzzz
and then split on that string.
x <- gsub("([!?.])[[:space:]]*","\\1zzzzz","The world is at end. What do you think? I am going crazy! These people are too calm.")
strsplit(x, "zzzzz")
Where \1
in the replacement string refers to the parenthesized sub-expression of the pattern.
Upvotes: 1
Reputation: 13363
Take a look at this question. Character classes like [:space:]
are defined within bracket expressions, so you need to enclose it in a set of brackets. Try:
vec <- strsplit(x, '[!?.][[:space:]]*')
vec
# [[1]]
# [1] "The world is at end" "What do you think"
# [3] "I am going crazy" "These people are too calm"
This gets rid of the leading spaces. To keep punctuation, use a positive lookbehind assertion with perl = TRUE
:
vec <- strsplit(x, '(?<=[!?.])[[:space:]]*', perl = TRUE)
vec
# [[1]]
# [1] "The world is at end." "What do you think?"
# [3] "I am going crazy!" "These people are too calm."
Upvotes: 2
Reputation: 109964
The sentSplit
function in the qdap package was create just for this task:
library(qdap)
sentSplit(data.frame(text = x), "text")
## tot text
## 1 1.1 The world is at end.
## 2 2.2 What do you think?
## 3 3.3 I am going crazy!
## 4 4.4 These people are too calm.
Upvotes: 6