Reputation: 23
I've got two sets of data describing atomic positions. They're in separate files that I would like to compare, aim being identifying matching atoms by their coordinates. Data looks like the following in both cases, and there's going to be up to a 1000 or so entries. The files are of different lengths since they describe different sized systems and have the following format:
1 , 0.000000000000E+00 0.000000000000E+00
2 , 0.000000000000E+00 2.468958660000E+00
3 , 0.000000000000E+00 -2.468958660000E+00
4 , 2.138180920454E+00 -1.234479330000E+00
5 , 2.138180920454E+00 1.234479330000E+00
The first column is the entry ID, second is a set of coordinates in the x,y.
What I'd like to do is compare the coordinates in both sets of data, identify matches and the corresponding ID eg "Entry 3 in file 1 corresponds to Entry 6 in file 2." I'll be using this information to alter the coordinate values within file 2.
I've read the files, line by line and split them into two entries per line using the command, then put them into a list, but am a bit stumped as to how to specify the comparison bit - particularly telling it to compare the second entries only, whilst being able to call the first entry. I'd imagine it would require looping ?
Code looks like this so far:
open1 = open('./3x3supercell_coord_clean','r')
openA = open('./6x6supercell_coord_clean','r')
small_list=[]
for line in open1:
stripped_small_line = line.strip()
column_small = stripped_small_line.split(",")
small_list.append(column_small)
big_list=[]
for line in openA:
stripped_big_line = line.strip()
column_big = stripped_big_line.split(",")
big_list.append(column_big)
print small_list[2][1] #prints out coords only
Upvotes: 2
Views: 1644
Reputation: 3417
Here's an approach that uses dictionaries:
coords = {}
with open('first.txt', 'r') as first_list:
for i in first_list:
pair = [j for j in i.split(' ') if j]
coords[','.join(pair[2:4])] = pair[0]
#reformattted coords used as key "2.138180920454E+00,-1.234479330000E+00"
with open('second.txt', 'r') as second_list:
for i in second_list:
pair = [j for j in i.split(' ') if j]
if ','.join(pair[2:4]) in coords:
#reformatted coords from second list checked for presence in keys of dictionary
print coords[','.join(pair[2:4])], pair[0]
What's going on here is that each of your coordinates from file A (which you have stated will be distinct), get stored into a dictionary as the key. Then, the first file is closed and the second file is opened. The second list's coordinates get opened, reformatted to match how the dictionary keys are saved and checks for membership. If the coordinate string from list B is in dictionary coords
, the pair exists in both lists. It then prints the ID from the first and second list, regarding that match.
Dictionary lookups are much faster O(1). This approach also has the advantage of not needing to have all the data in memory in order to check (just one list) as well as not worrying about type-casting, e.g., float/int conversions.
Upvotes: 0
Reputation: 5467
Build two dictionaries the following way:
# do your splitting to populate two dictionaries of this format:
# mydata1[Coordinate] = ID
# i.e.
for line in data1.split():
coord = line[2] + ' ' + line[3]
id = line[0]
mydata1[coord] = id
for line in data2.split():
coord = line[2] + ' ' + line[3]
id = line[0]
mydata2[coord] = id
#then we can use set intersection to find all coordinates in both key sets
set1=set(mydata1.keys())
set2=set(mydata2.keys())
intersect = set1.intersection(set2)
for coordinate in intersect:
print ' '.join(["Coordinate", str(coordinate), "found in set1 id", set1[coordinate]), "and set2 id", set2[coordinate])])
Upvotes: 0
Reputation:
Use a dictionary with coordinates as keys.
data1 = """1 , 0.000000000000E+00 0.000000000000E+00
2 , 0.000000000000E+00 2.468958660000E+00
3 , 0.000000000000E+00 -2.468958660000E+00
4 , 2.138180920454E+00 -1.234479330000E+00
5 , 2.138180920454E+00 1.234479330000E+00"""
# Read data1 into a list of tupes (id, x, y)
coords1 = [(int(line[0]), float(line[2]), float(line[3])) for line in
(line.split() for line in data1.split("\n"))]
# This dictionary will map (x, y) -> id
coordsToIds = {}
# Add coords1 to this dictionary.
for id, x, y in coords1:
coordsToIds[(x, y)] = id
# Read coords2 the same way.
# Left as an exercise to the reader.
# Look up each of coords2 in the dictionary.
for id, x, y in coords2:
if (x, y) in coordsToIds:
print(coordsToIds[(x, y)] # the ID in coords1
Beware that comparing floats is always a problem.
Upvotes: 2
Reputation: 2049
If all you are doing is trying to compare the second element of each element in two lists, that can be done by having each coord compared against each coord in the opposite file. This is definitely not the fastest way to go about it, but it should get you the results you need.It scans through small list, and checks every small_entry[1] (the coordinate) against every coordinate for each entry in big_list
for small_entry in small_list:
for big_entry in big_list:
if small_entry[1] == big_entry[1] :
print(small_entry[0] + "matches" + big_entry[0])
something like this?
Upvotes: 1