Reputation: 6277
I'm using the excellent tidytext
package to tokenize sentences in several paragraphs. For instance, I want to take the following paragraph:
"I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."
and tokenize it into the two sentences
However, when I use the default sentence tokenizer of tidytext
I get three sentences.
Code
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "sentences")
Result
# A tibble: 3 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr.
2 darcy has no defect.
3 he owns it himself without disguise.
What is a simple way to use tidytext
to tokenize sentences but without running into issues with common abbreviations such as "Mr." or "Dr." being interpreted as sentence endings?
Upvotes: 4
Views: 4123
Reputation: 1482
Both corpus and quanteda have special handling for abbreviations when determining sentence boundaries. Here's how to split sentences with corpus:
library(dplyr)
library(corpus)
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))
text_split(df$Example_Text, "sentences")
## parent index text
## 1 1 1 I am perfectly convinced by it that Mr. Darcy has no defect.
## 2 1 2 He owns it himself without disguise.
If you want to stick with unnest_tokens
, but want a more exhaustive list of English abbreviations, you can follow @useR's advice but use the corpus abbreviation list (most of which were taken from the Common Locale Data Repository):
abbrevations_en
## [1] "A." "A.D." "a.m." "A.M." "A.S." "AA."
## [7] "AB." "Abs." "AD." "Adj." "Adv." "Alt."
## [13] "Approx." "Apr." "Aug." "B." "B.V." "C."
## [19] "C.F." "C.O.D." "Capt." "Card." "cf." "Col."
## [25] "Comm." "Conn." "Cont." "D." "D.A." "D.C."
## (etc., 155 total)
Upvotes: 6
Reputation: 18681
You can use a regex as splitting condition, but there is no guarantee that this would include all common hororifics:
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = "(?<!\\b\\p{L}r)\\.")
Result:
# A tibble: 2 x 1
Sentence
<chr>
1 i am perfectly convinced by it that mr. darcy has no defect
2 he owns it himself without disguise
You can of course always create your own list of common titles, and create a regex based on that list:
titles = c("Mr", "Dr", "Mrs", "Ms", "Sr", "Jr")
regex = paste0("(?<!\\b(", paste(titles, collapse = "|"), "))\\.")
# > regex
# [1] "(?<!\\b(Mr|Dr|Mrs|Ms|Sr|Jr))\\."
unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
pattern = regex)
Upvotes: 7