Reputation: 191
I'm trying to split a string into sentences using Stanford NLP parser, I used the sample code provided by Stanford NLP but it gave me words instead of sentences.
here's the sample input:
"this is sample input. I want to split this text into a list of sentences. Please help"
here's my desired output:
["this is sample input.", "I want to split this text into a list of sentences.", "Please help"]
What I've done:
I heard there's a nltk parser that uses stanfordnlp library, but I was unable to get any sample guide for it.
At this point, I'm quite confused as there's almost no exhaustive python guide for Stanford NLP. It's mandatory to use python for this task, as other components in my research use python to process the data. Please help! thank you.
sample code:
import stanfordnlp
nlp = stanfordnlp.Pipeline(processors='tokenize', lang='en')
doc = nlp(a)
for i, sentence in enumerate(doc.sentences):
print(f"====== Sentence {i+1} tokens =======")
print(*[f"index: {token.index.rjust(3)}\ttoken: {token.text}" for token in sentence.tokens], sep='\n')
print(doc.sentences.tokens.text[2])
output:
====== Sentence 84 tokens =======
index: 1 token: Retweet
index: 2 token: 10
index: 3 token: Like
index: 4 token: 83
index: 5 token: End
index: 6 token: of
index: 7 token: conversation
index: 8 token: ©
index: 9 token: 2019
index: 10 token: Twitter
index: 11 token: About
index: 12 token: Help
index: 13 token: Center
index: 14 token: Terms
index: 15 token: Privacy
index: 16 token: policy
====== Sentence 85 tokens =======
index: 1 token: Cookies
index: 2 token: Ads
index: 3 token: info
source : https://stanfordnlp.github.io/stanfordnlp/pipeline.html
Upvotes: 1
Views: 1523
Reputation: 143097
I would use normal split('.')
but it will not work if sentence ends on ?
or !
, etc.
It would need regex
but it still may treats ...
inside sentence as ends of three sentences.
With stanfordnlp
I can only concatenate words in sentence so it gives sentence as one strings but this simple method adds spaces before ,.?!
, etc.
import stanfordnlp
text = "this is ... sample input. I want to split this text into a list of sentences. Can you? Please help"
nlp = stanfordnlp.Pipeline(processors='tokenize', lang='en')
doc = nlp(text)
for i, sentence in enumerate(doc.sentences):
sent = ' '.join(word.text for word in sentence.words)
print(sent)
Result
this is ... sample input .
I want to split this text into a list of sentences .
Can you ?
Please help
Maybe in source code it could find how it splits text to sentences and use it.
Upvotes: 1