Reputation: 4232
Following a tutorial in a book, using the following code to split text into sentences,
def sentences
gsub(/\n|\r/, ' ').split(/\.\s*/)
end
It works, but if theres a newline which began without a period preceding it, for example,
Hello. two line sentence
and heres the new line
theres a "\t" placed at the beginning of each sentence. So if i called the method on the above sentence i would get
["Hello." "two line sentence /tand heres the new line"]
Any help would be much appreciated! Thanks!
Upvotes: 1
Views: 1898
Reputation: 5617
Splitting text into sentences is best achieved using Stanford CoreNLP. In the example method provided in the question, any acronyms or name prefixes such as "Mr." would also be split.
The stanford-core-nlp ruby gem provides the ruby interface. See the instructions for installing the gem and Stanford CoreNLP in this answer, then you could write some code like this:
require "stanford-core-nlp"
StanfordCoreNLP.use :english
StanfordCoreNLP.model_files = {}
StanfordCoreNLP.default_jars = [
'joda-time.jar',
'xom.jar',
'stanford-corenlp-3.5.0.jar',
'stanford-corenlp-3.5.0-models.jar',
'jollyday.jar',
'bridge.jar'
]
pipeline = StanfordCoreNLP.load(:tokenize, :ssplit)
text = 'Hello. two line sentence
and heres the new line'
text = StanfordCoreNLP::Annotation.new(text)
pipeline.annotate(text)
text.get(:sentences).each{|s| puts "sentence: " + s.to_s}
#output:
#sentence: Hello.
#sentence: two line sentence
#and heres the new line
Upvotes: 3