user839145
user839145

Reputation: 7371

Python loop through two files, do computation, then output 3 files

I have 2 tab delimited files for example:

file1:

12  23  43  34
433  435  76  76

file2:

123  324  53  65
12  457  54  32

I would like to loop through these 2 files, comparing every line of file1 with file2 and vice versa. If, for example, the 1st number of 1st line in file1 is the same as the 1st number of 2nd line in file 2: I would like to put from the 1st line in file1 in a file called output. then I would like to put all the lines from file1 that didn't find a match in file 2 in a new file and all the lines from file2 that didn't find a match in file1 in a new file.

so far I have been able to find the matching lines and put them in a file but I'm having trouble putting the lines that didn't match into 2 separate files.

one=open(file1, 'r').readlines()
two=open(file2, 'r').readlines()
output=open('output.txt', 'w')
count=0
list1=[]    #list for lines in file1 that didn't find a match 
list2=[]    #list for lines in file2 that didn't find a match
for i in one:
    for j in two:
        columns1=i.strip().split('\t')
        num1=int(columns1[0])
        columns2=j.strip().split('\t')
        num2=int(columns2[0])
        if num1==num2:
           count+=1
           output.write(i+j)
        else:
           list1.append(i)        
           list2.append(j)

Problem I have here is with the else part. Can someone show me the right and better way to do this, I would greatly appreciate.

EDIT: Thanks for the quick responses everyone The 3 output I would be looking for is:

Output_file1: #Matching results between the 2 files

12 23 43 34 #line from file1
12 457 54 32 #line from file2

Output_file2: #lines from the first file that didn't find a match

433 435 76 76

Output_file3: #lines from the second file that didn't find a match

123 324 53 65

Upvotes: 2

Views: 1813

Answers (4)

Artsiom Rudzenka
Artsiom Rudzenka

Reputation: 29093

Think that it is not the best way but it works for me and looks prety easy for understanding:

# Sorry but was not able to check code below
def get_diff(fileObj1, fileObj2):
    f1Diff = []
    f2Diff = []
    outputData = []
    # x is one row
    f1Data = set(x.strip() for x in fileObj1)
    f2Data = set(x.strip() for x in fileObj2)
    f1Column1 = set(x.split('\t')[0] for x in f1Data)
    f2Column1 = set(x.split('\t')[0] for x in f2Data)
    l1Col1Diff = f1Column1 ^ f2Column1
    l2Col1Diff = f2Column1 ^ f1Column1
    commonPart = f1Column1 & f2column1
    for line in f1Data.union(f2Data):
        lineKey = line.split('\t')[0]
        if lineKey in common:
            outputData.append(line)
        elif lineKey in l1ColDiff:
            f1Diff.append(line)
        elif lineKey in l2ColDiff:
            f2Diff.append(line)
    return outputData, f1Diff, f2Diff

outputData, file1Missed, file2Missed = get_diff(open(file1, 'r'), open(file2, 'r'))

Upvotes: 1

eugene_che
eugene_che

Reputation: 1997

I think that this code fits your purposes

one=open(file1, 'r').readlines()
two=open(file2, 'r').readlines()
output=open('output.txt', 'w')

first = {x.split('\t')[0] for x in one}
second = {x.split('\t')[0] for x in two}
common = first.intersection( second )
list1 = filter( lambda x: not x.split('\t')[0] in common, one )
list2 = filter( lambda x: not x.split('\t')[0] in common, two )
res1 = filter( lambda x: x.split('\t')[0] in common, one )
res2 = filter( lambda x: x.split('\t')[0] in common, two )
count = len( res1 )
for x in range(count):
    output.write( res1[x] )
    output.write( res2[x] )

Upvotes: 1

Lie Ryan
Lie Ryan

Reputation: 64837

I'd suggest using set operation

from collections import defaultdict

def parse(filename):
    result = defaultdict(list)
    for line in open(filename):
        # take the first number and put it in result
        num = int(line.strip().split(' ')[0])
        result[num].append(line)  
    return result

def select(selected, items):
    result = []
    for s in selected:
        result.extend(items[s])
    return result

one = parse('one.txt')
two = parse('two.txt')
one_s = set(one)
two_s = set(two)
intersection = one_s & two_s
one_only = one_s - two_s
two_only = two_s - one_s

one_two = defaultdict(list)
for e in one: one_two[e].extend(one[e])
for e in two: one_two[e].extend(two[e])

open('intersection.txt', 'w').writelines(select(intersection, one_two))
open('one_only.txt', 'w').writelines(select(one_only, one))
open('two_only.txt', 'w').writelines(select(two_only, two))

Upvotes: 2

Pat B
Pat B

Reputation: 564

I would suggest that you use the csv module to read your files like so (you might have to mess around with the dialect, see http://docs.python.org/library/csv.html for help:

import csv
one = csv.reader(open(file1, 'r'), dialect='excell')
two = csv.reader(open(file2, 'r'), dialect='excell')

then you might find it easier to "zip" along the lines of both files at the same time like so (see http://docs.python.org/library/itertools.html#itertools.izip_longest):

import itertools
file_match = open('match', 'w')
file_nomatch1 = open('nomatch1', 'w')
file_nomatch2 = open('nomatch2', 'w')
for i,j in itertools.izip_longest(one, two, fillvalue="-"):
    if i[0] == j[0]:
        file_match.write(str(i)+'\n')
    else:
        file_nomatch1.write(str(i)+'\n')
        file_nomatch2.write(str(j)+'\n') 
        # and maybe handle the case where one is "-"

I reread the post and realized you are looking for a match between ANY two lines in both files. Maybe someone will find the above code useful, but it wont solve your particular problem.

Upvotes: 2

Related Questions