Stanfordnlp python - sentence split and other simple functionality

Question

I'm trying to split a string into sentences using Stanford NLP parser, I used the sample code provided by Stanford NLP but it gave me words instead of sentences.

here's the sample input:

"this is sample input. I want to split this text into a list of sentences. Please help"

here's my desired output:

["this is sample input.", "I want to split this text into a list of sentences.", "Please help"]

What I've done:

NLTK sent_tokenizer(); does not split newlines, and seems less accurate than stanfordnlp
stanfordnlp split is great but the sample output isn't in the list of sentences

I heard there's a nltk parser that uses stanfordnlp library, but I was unable to get any sample guide for it.

At this point, I'm quite confused as there's almost no exhaustive python guide for Stanford NLP. It's mandatory to use python for this task, as other components in my research use python to process the data. Please help! thank you.

sample code:

import stanfordnlp

nlp = stanfordnlp.Pipeline(processors='tokenize', lang='en')
doc = nlp(a)
for i, sentence in enumerate(doc.sentences):
    print(f"====== Sentence {i+1} tokens =======")
    print(*[f"index: {token.index.rjust(3)}	token: {token.text}" for token in sentence.tokens], sep='
')
print(doc.sentences.tokens.text[2])

output:

====== Sentence 84 tokens =======
index:   1  token: Retweet
index:   2  token: 10
index:   3  token: Like
index:   4  token: 83
index:   5  token: End
index:   6  token: of
index:   7  token: conversation
index:   8  token: ©
index:   9  token: 2019
index:  10  token: Twitter
index:  11  token: About
index:  12  token: Help
index:  13  token: Center
index:  14  token: Terms
index:  15  token: Privacy
index:  16  token: policy
====== Sentence 85 tokens =======
index:   1  token: Cookies
index:   2  token: Ads
index:   3  token: info

source : https://stanfordnlp.github.io/stanfordnlp/pipeline.html

furas · Accepted Answer

I would use normal split('.') but it will not work if sentence ends on ? or !, etc. It would need regex but it still may treats ... inside sentence as ends of three sentences.

With stanfordnlp I can only concatenate words in sentence so it gives sentence as one strings but this simple method adds spaces before ,.?!, etc.

import stanfordnlp

text = "this is ... sample input. I want to split this text into a list of sentences. Can you? Please help"

nlp = stanfordnlp.Pipeline(processors='tokenize', lang='en')
doc = nlp(text)

for i, sentence in enumerate(doc.sentences):
    sent = ' '.join(word.text for word in sentence.words)
    print(sent)

Result

this is ... sample input .
I want to split this text into a list of sentences .
Can you ?
Please help

Maybe in source code it could find how it splits text to sentences and use it.

Stanfordnlp python - sentence split and other simple functionality

Answers (1)

Related Questions