user1162069
user1162069

Reputation:

How to parse a file sentence by sentence in Python

I need to read a large amount of large text files.

For each file, I need to open it and read in text sentence by sentence.

Most of approaches I found is read line by line.

How can I do it with Python?

Upvotes: 2

Views: 3753

Answers (2)

matisetorm
matisetorm

Reputation: 853

If you want sentence tokenization, nltk is probably the quickest way to do so. http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt Will get you pretty far.

i.e. code from docs

>>> import nltk.data
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... '''
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print('\n-----\n'.join(sent_detector.tokenize(text.strip())))


Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.

Upvotes: 4

Alen
Alen

Reputation: 16

If the files have large amounts of lines you could make a generator using the yield statement

def read(filename):
    file = open(filename, "r")
    for line in file.readlines():
        for word in line.split():
            yield word

for word in read("sample.txt"):
    print word

This would return all the words of each line of the file

Upvotes: -2

Related Questions