Reputation: 21
I have two files with lots of columns and different information about a bunch of objects, that come with object IDs. I need to find matches between the two files, but the object IDs come in two different formats:
12-12-1 in one file will be written as 0012 00012 1 in the other. For instance, in one file I have:
0001 01531 1
0001 01535 1
0001 01538 1
Which corresponds to this in the other:
1-1531-1
1-1535-1
1-1538-1
Something as simple as
matches = open('matches.dat','w')
for j in range(len(file1)):
for i in range(len(file2)):
if file1[j] == file2[i]:
matches.write('{}/n'.format(file1[j]))
doesn't seem to do the trick.
file1 and file2 here are lists that contain all the object IDs from the different files.
What do I add to my code to find the matches?
Upvotes: 2
Views: 52
Reputation: 43
A few notes:
-You don't close your matches file at the end of your code. Using with
will automatically take care of file cleanup.
-Your newline character in the last line of your code isn't escaped properly - it's \n
, not /n
.
If your numeric formatting is always constant (i.e. the first column is always padded to four values, the second is always padded to 5, and the last is never padded), this should work:
with open('matches.dat', 'w') as matches:
for j in range(len(file1)):
for i in range(len(file2)):
match_list = file2[i].split('-')
match_str = '{} {} {}'.format(match_list[0].zfill(4), match_list[1].zfill(5), match_list[2])
if file1[j] == match_str:
matches.write('{}\n'.format(file1[j]))
Upvotes: 0
Reputation: 2497
import re
def convert(word):
word = word.strip().replace(' ', '-')
return re.sub('\\b0+', '', word) # strip all 0s after a word boundary (space or beginning of line)
You can calculate the intersection in O(n+m) time by converting both to a list and computing the intersection
file1_ids = {convert(line) for line in file1}
file2_ids = {line for line in file2}
matches = file1_ids.intersection(file2_ids)
Upvotes: 1