Van Peer
Van Peer

Reputation: 2167

python - find matching sentences in file

I've a text file which contains 35k words in paragraphs. Sample below

This sentence does repeat? This sentence does not repeat! This sentence does not repeat. This sentence does repeat.
This sentence does repeat. This sentence does not repeat! This sentence does not repeat. This sentence does repeat!

I wanted to identify matching sentences. One way I managed to find is to split the paragraphs into separate lines using ., !, ? etc. as the delimiter's and look for matching lines.

Code

import collections as col

with open('txt.txt', 'r') as f:
    l = f.read().replace('. ','.\n').replace('? ','?\n').replace('! ','!\n').splitlines()
print([i for i, n in col.Counter(l).items() if n > 1])

Please suggest some better approaches.

Upvotes: 0

Views: 809

Answers (3)

Chen A.
Chen A.

Reputation: 11338

You can do it a different. The regex module is very powerful:

import re
from collections import Counter

pat = r'(\?)|(\.)|(!)'
c = Counter()
with open('filename') as f:
       for line in f:
              c[re.sub(pat, '\n', line)] += 1

This creates a regex pattern matching ?, . or ! and replaces it with a \n. Using the for loop, this happens on a line basis.

Upvotes: 0

rb612
rb612

Reputation: 5583

You can use split:

import re
...
l = re.split(r'[?!.]*',f.read())

Upvotes: 4

Stuart Buckingham
Stuart Buckingham

Reputation: 1784

I cannot guarentee it would be the fastest, but I would try to exploit the speed of sort. First I would split the text by punctuation to give a list of sentances, then run sort on the list to order all the sentances, then finally loop through the list and count the number of consecutive sentances that are the same and store the sentance and count in a dict.

Upvotes: 0

Related Questions