printing out number of sentences in a text file

so I really cannot see what I am doing wrong here, the number of sentences keeps saying it is 0, however I am trying to count the number of sentences/stops with the text.count('.')

Is there anythingin my code which would make this print out "0"?

Thanks

def countSentences(fileName) :
    """This is a function to count the number
    of sentences in a given text file"""
    f = open(fileName, 'r')
    text = f.read()
    text = text.split()
    print("Total sentences : " + str(text.count('.')))

    f.close()

in Main() I have

print(countSentences('phrases.txt'))

which passes in a file with numerous sentences.

Upvotes: 3

Views: 5390

Answers (3)

user304663
user304663

Reputation: 93

spaCy will take care of your problem.

import spacy 

nlp = spacy.load('en_core_web_sm')

with open('fileNamepath') as f: 
    doc = nlp(f.read())
sentence_tokens = [[token.text for token in sent] for sent in doc.sents]
print(len(sentence_tokens))

sentence_tokens creates list of strings iterated over each sentence within fileName using the iterator sents. You can read more about it here

Upvotes: 2

gdaras
gdaras

Reputation: 10129

Ok, let's see. Correcting your code in order to count '.' it's an easy thing to do. It will go like this:

  with open('example_file.txt', 'rb') as f:
      text = str(f.read())
      num_sentences = str(text.count('.'))
      print("Number of sentences found: {}".format(num_sentences))

However, as Joshua pointed out, counting '.' is not enough. There are a lot of cases in which a dot can appear as not a sentence boundary. For example, consider abbreviations or even emojis. In order to count sentences, you need a natural language processing library designed for that or at least a more sophisticated approach.

Think of a file called example_file.txt with the following inside:

Hello this is an example file. I am pleased that you found me. The hour now is 2:00 p.m. Hope you have a great day.

Your code would answer 5, but the correct answer is 4.

The following code shows the error and how it can be done correctly using spacy.

with open('example_file.txt', 'rb') as f:
    text = str(f.read())
    num_sentences = str(text.count('.'))
    print("Number of sentences found: {}".format(num_sentences))
    import spacy
    nlp = spacy.load('en')
    doc = nlp(text)
    print("Actual number of sentences: {}".format(len(list(doc.sents))))

Hope it helps :)

Upvotes: 2

Joshua Smith
Joshua Smith

Reputation: 6621

It would appear from your code that the var text is an array of strings, so the count will find no strings that are just .

Counting sentences is a pretty tricky thing, since the . can show up in a lot of things that are not sentence terminating. I would recommend something like nltk or spacy to accomplish this task more effectively.

Upvotes: 2

Related Questions