Reputation: 1362
I have three txt files with a list of lists.
File 1 (9.7 thousand lines):
ID1, data 1
File 2 (2.1 million lines):
ID1, ID2
File 3 (1.1 thousand lines):
ID2, data 3
I want to make a file 4 that
I have made a script for it in python, but ATM it takes 1 hour.
Here is what it does:
file1 = []
file4 = []
file3 = []
final_list.append("ID1, ID2, DATA1, DATA2")
#Import file1
with open('file1.txt') as inputfile: #file 1: around 9.7k
for line in inputfile:
temp = line.strip().split(' ')
file1.append(temp)
#Import file3
with open('file3.txt') as inputfile: #file 3: around 1.1k
for line in inputfile:
temp = line.strip().split(' ')
file3.append(temp)
print len(file1)
#Iterate through file2 (so I only iterate once through this)
with open('file2.txt') as inputfile: #File 2: 2.1 million
for line in inputfile:
temp = line.strip().split(' ')
for sublist in file1: #Only if first element is also in list 1
if sublist[0] == temp[0]:
for sublist2 in file3:
if sublist2[0] == temp[1]:
file4.append([temp, sublist[1], sublist2[1]])
print len(file4)
print file4[:10]
thefile = open('final.txt', 'w')
for item in file4:
thefile.write("%s\n" % item)
thefile.close()
As mentioned, it takes an hour ATM. How can I improve performance? I have a lot of looping and was considering if this could be done quicker in some way...
Note: IDs only appear once, data can be repeated values
Upvotes: 2
Views: 56
Reputation: 618
Since your IDs are unique, as you write, you could use dictionaries instead of lists for file1 and file3. So your loop check to see if the ID is present reduces to a single lookup in the set of keys to those dictionaries. I don't know your original lists, but I presume that dictionaries are faster for your purpose. Thus you save two loop iterations over your long file. Some time will be spent on assembling the lists of keys, though. Please try the following approach:
file1 = {} # empty new dictionary
file4 = []
file3 = {}
final_list.append("ID1, ID2, DATA1, DATA2")
#Import file1
with open('file1.txt') as inputfile: #file 1: around 9.7k
for line in inputfile:
temp = line.strip().split(' ')
file1[temp[0]] = temp[1] # store ID1 and associated data in dict
#Import file3
with open('file3.txt') as inputfile: #file 3: around 1.1k
for line in inputfile:
temp = line.strip().split(' ')
file3[temp[0]] = temp[1] # store ID2 and associated data in dict
print len(file1)
#Iterate through file2 (so I only iterate once through this)
keys1 = file1.keys() # for fast lookup, precalculate the list of ID1 entries
keys3 = file3.keys() # for fast lookup, precalculate the list of ID2 entries
with open('file2.txt') as inputfile: #File 2: 2.1 million
for line in inputfile:
temp = line.strip().split(' ')
if temp[0] in keys1:
if temp[1] in keys3:
file4.append([temp, file1[temp[0]], file3[temp[0]]])
print len(file4)
print file4[:10]
thefile = open('final.txt', 'w')
for item in file4:
thefile.write("%s\n" % item)
thefile.close()
Regards,
Upvotes: 1