Optimal way to iterate through 3 files and generate a third file in python

Question

I have three txt files with a list of lists.

File 1 (9.7 thousand lines):

ID1, data 1

File 2 (2.1 million lines):

ID1, ID2

File 3 (1.1 thousand lines):

ID2, data 3

I want to make a file 4 that

takes all lines in file 1 (ID1 and data 1)
Get the ID2 for that lines ID1.
Get the data 3 for that ID2.
Save a file with ID1, data 1, ID2, data 3 for all lines in file 1 in file 4

I have made a script for it in python, but ATM it takes 1 hour.

Here is what it does:

file1 = []
file4 = []
file3 = []

final_list.append("ID1, ID2, DATA1, DATA2")

#Import file1
with open('file1.txt') as inputfile: #file 1: around 9.7k
    for line in inputfile:
        temp = line.strip().split(' ')
        file1.append(temp)

#Import file3
with open('file3.txt') as inputfile: #file 3: around 1.1k
    for line in inputfile:
        temp = line.strip().split(' ')
        file3.append(temp)

print len(file1)

#Iterate through file2 (so I only iterate once through this)
with open('file2.txt') as inputfile: #File 2: 2.1 million
    for line in inputfile:
        temp = line.strip().split(' ')
        for sublist in file1: #Only if first element is also in list 1
            if sublist[0] == temp[0]:
                for sublist2 in file3:
                    if sublist2[0] == temp[1]:
                        file4.append([temp, sublist[1], sublist2[1]])

print len(file4)

print file4[:10]

thefile = open('final.txt', 'w')
for item in file4:
  thefile.write("%s
" % item)
thefile.close()

As mentioned, it takes an hour ATM. How can I improve performance? I have a lot of looping and was considering if this could be done quicker in some way...

Note: IDs only appear once, data can be repeated values

Ukimiku · Accepted Answer

Since your IDs are unique, as you write, you could use dictionaries instead of lists for file1 and file3. So your loop check to see if the ID is present reduces to a single lookup in the set of keys to those dictionaries. I don't know your original lists, but I presume that dictionaries are faster for your purpose. Thus you save two loop iterations over your long file. Some time will be spent on assembling the lists of keys, though. Please try the following approach:

file1 = {}                              # empty new dictionary
file4 = []
file3 = {}

final_list.append("ID1, ID2, DATA1, DATA2")

#Import file1
with open('file1.txt') as inputfile:    #file 1: around 9.7k
    for line in inputfile:
        temp = line.strip().split(' ')
        file1[temp[0]] = temp[1]        # store ID1 and associated data in dict

#Import file3
with open('file3.txt') as inputfile:    #file 3: around 1.1k
    for line in inputfile:
        temp = line.strip().split(' ')
        file3[temp[0]] = temp[1]        # store ID2 and associated data in dict

print len(file1)

#Iterate through file2 (so I only iterate once through this)
keys1 = file1.keys()                    # for fast lookup, precalculate the list of ID1 entries
keys3 = file3.keys()                    # for fast lookup, precalculate the list of ID2 entries
with open('file2.txt') as inputfile:    #File 2: 2.1 million
    for line in inputfile:
        temp = line.strip().split(' ')
        if temp[0] in keys1:
            if temp[1] in keys3:
                file4.append([temp, file1[temp[0]], file3[temp[0]]])

print len(file4)

print file4[:10]

thefile = open('final.txt', 'w')
for item in file4:
  thefile.write("%s
" % item)
thefile.close()

Regards,

Optimal way to iterate through 3 files and generate a third file in python

Answers (1)

Related Questions