Reputation: 2167
I've a text file which contains 35k words in paragraphs. Sample below
This sentence does repeat? This sentence does not repeat! This sentence does not repeat. This sentence does repeat.
This sentence does repeat. This sentence does not repeat! This sentence does not repeat. This sentence does repeat!
I wanted to identify matching sentences. One way I managed to find is to split the paragraphs into separate lines using .
, !
, ?
etc. as the delimiter's and look for matching lines.
Code
import collections as col
with open('txt.txt', 'r') as f:
l = f.read().replace('. ','.\n').replace('? ','?\n').replace('! ','!\n').splitlines()
print([i for i, n in col.Counter(l).items() if n > 1])
Please suggest some better approaches.
Upvotes: 0
Views: 809
Reputation: 11338
You can do it a different. The regex module is very powerful:
import re
from collections import Counter
pat = r'(\?)|(\.)|(!)'
c = Counter()
with open('filename') as f:
for line in f:
c[re.sub(pat, '\n', line)] += 1
This creates a regex pattern matching ?, . or !
and replaces it with a \n
.
Using the for loop, this happens on a line basis.
Upvotes: 0
Reputation: 1784
I cannot guarentee it would be the fastest, but I would try to exploit the speed of sort
. First I would split the text by punctuation to give a list of sentances, then run sort on the list to order all the sentances, then finally loop through the list and count the number of consecutive sentances that are the same and store the sentance and count in a dict.
Upvotes: 0