Reputation: 21
I have 2 text files, my goal is to find the lines in file First.txt that are not in Second.txt and output said lines to a third text file Missing.txt, i have that done:
fn = "Missing.txt"
try:
fileOutPut = open(fn, 'w')
except IOError:
fileOutPut = open(fn, 'w')
fileOutPut.truncate()
filePrimary = open('First.txt', 'r', encoding='utf-8', errors='ignore')
fileSecondary = open('Second.txt', 'r', encoding='utf-8', errors='ignore')
bLines = set([thing.strip() for thing in fileSecondary.readlines()])
for line in filePrimary:
line = line.strip()
if line in bLines:
continue
else:
fileOutPut.write(line)
fileOutPut.write('\n')
fileOutPut.close()
filePrimary.close()
fileSecondary.close()
But after running the script i've come to a problem, there are lines that are very similar, examples:
[PR] Zero One Two Three ft Four
and (No space after the bracket)
[PR]Zero One Two Three ft Four
or
[PR] Zero One Two Three ft Four
and (capital F letter)
[PR] Zero One Two Three Ft Four
I have found SequenceMatcher, which does what i require, but how do i implement this into the comparison, since those are not just two strings, but a string and a set
Upvotes: 2
Views: 1565
Reputation: 43504
IIUC, you want to match lines even if the white space or capitalization is different.
One easy way to do this is to remove white space and just make everything the same case on the read:
import re
def format_line(line):
return re.sub("\s+", "", line.strip()).lower()
filePrimary = open('First.txt', 'r', encoding='utf-8', errors='ignore')
fileSecondary = open('Second.txt', 'r', encoding='utf-8', errors='ignore')
bLines = set([format_line(thing) for thing in fileSecondary.readlines()])
for line in filePrimary:
fline = format_line(line)
if fline in bLines:
continue
else:
fileOutPut.write(line + '\n')
Update 1: Fuzzy matching
If you wanted to fuzzy match, you could do something like nltk.metrics.distance.edit_distance
(docs)
but you can't get around comparing every line to every other line (worst case). You lose the speed of the in
operation.
For example
from nltk.metrics.distance import edit_distance as dist
threshold = 3 # the maximum number of edits between lines
for line in filePrimary:
fline = format_line(line)
match_found = any([dist(fline, other_line) < threshold for other_line in bLines])
if not match_found:
fileOutPut.write(line + '\n')
Upvotes: 2