user19914532
user19914532

Reputation:

Read and compare data from multiple files

I'm trying to make a function that will accept a list of filenames as parameter to access data from two files at a time and compare the values, if value matches it will be added to the set and then print set. The problem is that files have some matching values but function prints an empty set at the end.

def cross_reference(files):
    set_of_users = set()
    n = len(files)
    files = cycle(files)
    for index in range(n):
        with open(next(files), mode='r') as read_file:
            with open(next(files), mode='r') as read_file1:
                for contact in read_file:
                    for contact1 in read_file1:
                        if contact == contact1:
                            set_of_users.add(contact)
                            break
    print(set_of_users)

The files having values are:

file1.txt:

0709-12345
0724-87234
0723-67890
0721-16273

file2.txt:

0709-87263
0743-76346
0724-87234
0777-89264

file3.txt:

0724-87234
0743-87469
0709-12398
0709-78548

0724-87234 is common in all files but is not added in the set.

Upvotes: 0

Views: 98

Answers (3)

tripleee
tripleee

Reputation: 189689

Your code seems unnecessarily complex. Perhaps you have reasons to make it so complicated, but without those reasons, it would seem to make more sense simply something like

from collections import defaultdict
import glob

def cross_reference(files):
    seen = defaultdict(set)
    for file in files:
        with open(file) as lines:
            for line in lines:
                seen[line.rstrip('\n')].add(file)
    for item in seen.keys():
        if seen[item] == set(files):
            print(item)

cross_reference(glob.glob('file[1-3].txt'))

Upvotes: 0

sunnytown
sunnytown

Reputation: 1996

One reason might be that you are not stripping the contents of each line. So one line could be '0724-87234\n' and the other could be '0724-87234' without '\n'.

I would go for another approach like this:

def cross_reference(files):
    common_values = None
    for file in files:
        with open(file) as f:
            values = set([line.rstrip() for line in f.readlines()])
            if common_values is None:
                common_values = values
            else:
                common_values = common_values & values
    return common_values

Instead of looping over the range of the length of files, you can simply loop over the files themselves. Then you open the file, read all the lines into a list, then apply the rstip() method to each of the lines to get rid of '\n'. This is done with a list comprehension. Then you transform this list to a set. With sets, you can easily get the common values of two sets by doing set1 & set2. Here, I set common_values to None in the beginning. In the loop, I check if it is None and if so, I assign it to the set of the first file. For the next files, I apply the & operation between the set of the new file and the common_values set. So after each new file, common_values will only contain the lines which are present in all files checked upto that point.

Upvotes: 1

Yevhen Kuzmovych
Yevhen Kuzmovych

Reputation: 12140

for contact1 in read_file1: consumes the read_file1 on the first iteration of for contact in read_file: so all the consequent iterations don't go through the second file. You can preload lines once before iterations with lines1 = read_file1.readlines() and iterate over them:

def cross_reference(files):
    set_of_users = set()
    n = len(files)
    files = cycle(files)
    for index in range(n):
        with open(next(files), mode='r') as read_file:
            lines1 = read_file.readlines()
        with open(next(files), mode='r') as read_file1:
            lines2 = read_file2.readlines()

        for contact in lines1:
            for contact1 in lines2:
                if contact == contact1:
                    set_of_users.add(contact)
                    break
    print(set_of_users)

Upvotes: 0

Related Questions