Reputation: 97
so I really cannot see what I am doing wrong here, the number of sentences keeps saying it is 0, however I am trying to count the number of sentences/stops with the text.count('.')
Is there anythingin my code which would make this print out "0"?
Thanks
def countSentences(fileName) :
"""This is a function to count the number
of sentences in a given text file"""
f = open(fileName, 'r')
text = f.read()
text = text.split()
print("Total sentences : " + str(text.count('.')))
f.close()
in Main() I have
print(countSentences('phrases.txt'))
which passes in a file with numerous sentences.
Upvotes: 3
Views: 5390
Reputation: 93
spaCy will take care of your problem.
import spacy
nlp = spacy.load('en_core_web_sm')
with open('fileNamepath') as f:
doc = nlp(f.read())
sentence_tokens = [[token.text for token in sent] for sent in doc.sents]
print(len(sentence_tokens))
sentence_tokens
creates list of strings iterated over each sentence within fileName
using the iterator sents
. You can read more about it here
Upvotes: 2
Reputation: 10129
Ok, let's see. Correcting your code in order to count '.' it's an easy thing to do. It will go like this:
with open('example_file.txt', 'rb') as f:
text = str(f.read())
num_sentences = str(text.count('.'))
print("Number of sentences found: {}".format(num_sentences))
However, as Joshua pointed out, counting '.' is not enough. There are a lot of cases in which a dot can appear as not a sentence boundary. For example, consider abbreviations or even emojis. In order to count sentences, you need a natural language processing library designed for that or at least a more sophisticated approach.
Think of a file called example_file.txt with the following inside:
Hello this is an example file. I am pleased that you found me. The hour now is 2:00 p.m. Hope you have a great day.
Your code would answer 5, but the correct answer is 4.
The following code shows the error and how it can be done correctly using spacy.
with open('example_file.txt', 'rb') as f:
text = str(f.read())
num_sentences = str(text.count('.'))
print("Number of sentences found: {}".format(num_sentences))
import spacy
nlp = spacy.load('en')
doc = nlp(text)
print("Actual number of sentences: {}".format(len(list(doc.sents))))
Hope it helps :)
Upvotes: 2
Reputation: 6621
It would appear from your code that the var text
is an array of strings, so the count will find no strings that are just .
Counting sentences is a pretty tricky thing, since the .
can show up in a lot of things that are not sentence terminating. I would recommend something like nltk or spacy to accomplish this task more effectively.
Upvotes: 2