Hyperion
Hyperion

Reputation: 2625

Python - Compare values in lists (not 1:1 match)

I've got 2 txt files that are structured like this:

File 1

LINK1;FILENAME1
LINK2;FILENAME2
LINK3;FILENAME3

File 2

FILENAME1
FILENAME2
FILENAME3

And I use this code to print the "unique" lines contained in both files:

with open('1.txt', 'r') as f1, open('2.txt', 'r') as f2:
    a = f1.readlines()
    b = f2.readlines()

non_duplicates = [line for line in a if line not in b]
non_duplicates += [line for line in b if line not in a]

for i in range(1, len(non_duplicates)):
    print non_duplicates[i]

The problem is that in this way it prints all the lines of both files, what I want to do is to search if FILENAME1 is in some line of file 1 (the one with both links and filenams) and delete this line.

Upvotes: 1

Views: 146

Answers (3)

dwardu
dwardu

Reputation: 2300

Unless the files are too large, then you may print the lines in file1.txt (that I call entries) whose filename-part is not listed in file2.txt with something like this:

with open('file1.txt') as f1:
    entries = f1.read().splitlines()

with open('file2.txt') as f2:
    filenames_to_delete = f2.read().splitlines()

print [entry for entry in entries if entry.split(';')[1] not in filenames_to_delete]

If file1.txt is large and file2.txt is small, then you may load the filenames in file2.txt entirely in memory, and then open file1.txt and go through it, checking against the in-memory list.

If file1.txt is small and file2.txt is large, you may do it the other way round.

If file1.txt and file2.txt are both excessively large, then if it is known that both files’ lines are sorted by filename, one could write some elaborate code to take advantage of that sorting to get the task done without loading the entire files in memory, as in this SO question. But if this is not an issue, you’ll be better off loading everything in memory and keeping things simple.

P.S. Once it is not necessary to open the two files simultaneously, we avoid it; we open a file, read it, close it, and then repeat for the next. Like that the code is simpler to follow.

Upvotes: 0

Padraic Cunningham
Padraic Cunningham

Reputation: 180522

If the file2 is not too big create a set of all the lines, split the file1 lines and check if the second element is in the set of lines:

import  fileinput
import sys
with open("file2.txt") as f:
    lines = set(map(str.rstrip,f)) # itertools.imap python2
    for line in fileinput.input("file1.txt",inplace=True): 
        # if FILENAME1 etc.. is not in the line write the line
        if line.rstrip().split(";")[1] not in lines:
            sys.stdout.write(line)

file1:

LINK1;FILENAME1
LINK2;FILENAME2
LINK3;FILENAME3
LINK1;FILENAME4
LINK2;FILENAME5
LINK3;FILENAME6

file2:

FILENAME1
FILENAME2
FILENAME3

file1 after:

LINK1;FILENAME4
LINK2;FILENAME5
LINK3;FILENAME6

fileinput.input with inplace changes the original file. You don't need to store the lines in a list.

You can also write to a tempfile, writing the unique lines to it and using shutil.move to replace the original file:

from tempfile import NamedTemporaryFile
from shutil import move

with open("file2.txt") as f, open("file1.txt") as f2, NamedTemporaryFile(dir=".",delete=False) as out:
    lines = set(map(str.rstrip,f))
    for line in f2:
        if line.rstrip().split(";")[1] not in lines:
            out.write(line)

move(out.name,"file1.txt")

If your code errors you won't lose any data in the original file using a tempfile.

using a set to store the lines means we have on average 0(1) lookups, storing all the lines in a list would give you a quadratic as opposed to a linear solution which for larger files would give you a significantly more efficient solution. There is also no need to store all the lines of the other file in a list with readlines as you can write as you iterate over the file object and do your lookups.

Upvotes: 0

Stefano Sanfilippo
Stefano Sanfilippo

Reputation: 33116

You need to first load all lines in 2.txt and then filter lines in 1.txt that contains a line from the former. Use a set or frozenset to organize the "blacklist", so that each not in runs in O(1) in average. Also note that f1 and f2 are already iterable:

with open('2.txt', 'r') as f2:
    blacklist = frozenset(f2)

with open('1.txt', 'r') as f1:
    non_duplicates = [x.strip() for x in f1 if x.split(";")[1] not in blacklist]

Upvotes: 3

Related Questions