Cristian Tamblay
Cristian Tamblay

Reputation: 65

Extracting sentences including a word from large corpus, including the punctuation, in python

I am working with a big corpus (~30GB) and I need to extract sentences containing a list of words (~5000) including the punctuation. I'm using the regex approach but I'm open at any suggestions regarding the efficiency of the method. The following code extract the sentences including 'anarchism', but without the punctuation, obtained from here.

f_in = open(f_path, 'r')
for line in f_in:
    sentences = re.findall(r'([^.!?]*anarchism[^.!?]*)', line)

Input:

anarchism, is good. anarchism? anarchism!

Actual return:

['anarchism, is good', ' anarchism', ' anarchism']

Expected return:

['anarchism, is good.', 'anarchism?', 'anarchism!']

Any suggestions?

Upvotes: 1

Views: 628

Answers (2)

l_l_l_l_l_l_l_l
l_l_l_l_l_l_l_l

Reputation: 538

Your pattern will split sentences in places you probably don't like; for example, "Mr. Tamblay" (because of the period). You can use a sentence tokenizer from nltk for a more sophisticated split. To actually check if any of your words is in the sentence, you can of course filter over the sentence tokens.

import nltk
sentence_tokenzer = nltk.tokenize.punkt.PunktSentenceTokenizer()
...
for line in f_in:
    for start, end in sentence_tokenizer.span_tokenize(line):
        sentence = line[start:end]
        for keyword in keywords:
            if keyword in sentence:
                do_something()

If basic iterations over all the keywords are too slow, you can explore options to search the sentence for all strings at once using the Aho-Corasick algorithm.

Upvotes: 1

finefoot
finefoot

Reputation: 11372

With [^.!?]* at the end of your pattern, you're explicitly excluding any punctuation. If you're certain that your sentence ends in exactly one of [.!?], you could just add that to the pattern:

>>> import re
>>> line = "anarchism, is good. anarchism? anarchism!"
>>> re.findall(r'([^.!?]*anarchism[^.!?]*[.!?])', line)
['anarchism, is good.', ' anarchism?', ' anarchism!']

Upvotes: 1

Related Questions