bschneidr
bschneidr

Reputation: 6277

Tokenizing sentences with unnest_tokens(), ignoring abbreviations

I'm using the excellent tidytext package to tokenize sentences in several paragraphs. For instance, I want to take the following paragraph:

"I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."

and tokenize it into the two sentences

  1. "I am perfectly convinced by it that Mr. Darcy has no defect."
  2. "He owns it himself without disguise."

However, when I use the default sentence tokenizer of tidytext I get three sentences.

Code

df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))


unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "sentences")

Result

# A tibble: 3 x 1
                              Sentence
                                <chr>
1 i am perfectly convinced by it that mr.
2                    darcy has no defect.
3    he owns it himself without disguise.

What is a simple way to use tidytext to tokenize sentences but without running into issues with common abbreviations such as "Mr." or "Dr." being interpreted as sentence endings?

Upvotes: 4

Views: 4123

Answers (2)

Patrick Perry
Patrick Perry

Reputation: 1482

Both corpus and quanteda have special handling for abbreviations when determining sentence boundaries. Here's how to split sentences with corpus:

library(dplyr)
library(corpus)
df <- data_frame(Example_Text = c("I am perfectly convinced by it that Mr. Darcy has no defect. He owns it himself without disguise."))

text_split(df$Example_Text, "sentences")
##   parent index text                                                         
## 1 1          1 I am perfectly convinced by it that Mr. Darcy has no defect. 
## 2 1          2 He owns it himself without disguise.

If you want to stick with unnest_tokens, but want a more exhaustive list of English abbreviations, you can follow @useR's advice but use the corpus abbreviation list (most of which were taken from the Common Locale Data Repository):

abbrevations_en
##  [1] "A."       "A.D."     "a.m."     "A.M."     "A.S."     "AA."       
##  [7] "AB."      "Abs."     "AD."      "Adj."     "Adv."     "Alt."    
## [13] "Approx."  "Apr."     "Aug."     "B."       "B.V."     "C."      
## [19] "C.F."     "C.O.D."   "Capt."    "Card."    "cf."      "Col."    
## [25] "Comm."    "Conn."    "Cont."    "D."       "D.A."     "D.C."    
## (etc., 155 total)

Upvotes: 6

acylam
acylam

Reputation: 18681

You can use a regex as splitting condition, but there is no guarantee that this would include all common hororifics:

unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
              pattern = "(?<!\\b\\p{L}r)\\.")

Result:

# A tibble: 2 x 1
                                                     Sentence
                                                        <chr>
1 i am perfectly convinced by it that mr. darcy has no defect
2                         he owns it himself without disguise

You can of course always create your own list of common titles, and create a regex based on that list:

titles =  c("Mr", "Dr", "Mrs", "Ms", "Sr", "Jr")
regex = paste0("(?<!\\b(", paste(titles, collapse = "|"), "))\\.")
# > regex
# [1] "(?<!\\b(Mr|Dr|Mrs|Ms|Sr|Jr))\\."

unnest_tokens(df, input = "Example_Text", output = "Sentence", token = "regex",
              pattern = regex)

Upvotes: 7

Related Questions