Fidycent
Fidycent

Reputation: 21

Python Comparing text files for similar or equal lines

I have 2 text files, my goal is to find the lines in file First.txt that are not in Second.txt and output said lines to a third text file Missing.txt, i have that done:

fn = "Missing.txt"
try:
    fileOutPut = open(fn, 'w')
except IOError:
    fileOutPut = open(fn, 'w')
fileOutPut.truncate()
filePrimary = open('First.txt', 'r', encoding='utf-8', errors='ignore')
fileSecondary = open('Second.txt', 'r', encoding='utf-8', errors='ignore')
bLines = set([thing.strip() for thing in fileSecondary.readlines()])
for line in filePrimary:
    line = line.strip()
    if line in bLines:
        continue
    else:
        fileOutPut.write(line)
        fileOutPut.write('\n')
fileOutPut.close()
filePrimary.close()
fileSecondary.close()

But after running the script i've come to a problem, there are lines that are very similar, examples:

[PR] Zero One Two Three ft Four

and (No space after the bracket)

[PR]Zero One Two Three ft Four

or

[PR] Zero One Two Three ft Four

and (capital F letter)

[PR] Zero One Two Three Ft Four

I have found SequenceMatcher, which does what i require, but how do i implement this into the comparison, since those are not just two strings, but a string and a set

Upvotes: 2

Views: 1565

Answers (1)

pault
pault

Reputation: 43504

IIUC, you want to match lines even if the white space or capitalization is different.

One easy way to do this is to remove white space and just make everything the same case on the read:

import re

def format_line(line):
    return re.sub("\s+", "", line.strip()).lower()

filePrimary = open('First.txt', 'r', encoding='utf-8', errors='ignore')
fileSecondary = open('Second.txt', 'r', encoding='utf-8', errors='ignore')
bLines = set([format_line(thing) for thing in fileSecondary.readlines()])
for line in filePrimary:
    fline = format_line(line)
    if fline in bLines:
        continue
    else:
        fileOutPut.write(line + '\n')

Update 1: Fuzzy matching

If you wanted to fuzzy match, you could do something like nltk.metrics.distance.edit_distance (docs) but you can't get around comparing every line to every other line (worst case). You lose the speed of the in operation.

For example

from nltk.metrics.distance import edit_distance as dist

threshold = 3  # the maximum number of edits between lines

for line in filePrimary:
    fline = format_line(line)
    match_found = any([dist(fline, other_line) < threshold for other_line in bLines])

    if not match_found:
        fileOutPut.write(line + '\n')

Upvotes: 2

Related Questions