Reputation: 23

Identifying coordinate matches from two files using python

I've got two sets of data describing atomic positions. They're in separate files that I would like to compare, aim being identifying matching atoms by their coordinates. Data looks like the following in both cases, and there's going to be up to a 1000 or so entries. The files are of different lengths since they describe different sized systems and have the following format:

   1   ,    0.000000000000E+00  0.000000000000E+00    
   2   ,   0.000000000000E+00  2.468958660000E+00  
   3   ,    0.000000000000E+00 -2.468958660000E+00  
   4   ,   2.138180920454E+00 -1.234479330000E+00  
   5   ,    2.138180920454E+00  1.234479330000E+00

The first column is the entry ID, second is a set of coordinates in the x,y.

What I'd like to do is compare the coordinates in both sets of data, identify matches and the corresponding ID eg "Entry 3 in file 1 corresponds to Entry 6 in file 2." I'll be using this information to alter the coordinate values within file 2.

I've read the files, line by line and split them into two entries per line using the command, then put them into a list, but am a bit stumped as to how to specify the comparison bit - particularly telling it to compare the second entries only, whilst being able to call the first entry. I'd imagine it would require looping ?

Code looks like this so far:

open1 = open('./3x3supercell_coord_clean','r')
openA = open('./6x6supercell_coord_clean','r')

small_list=[]

for line in open1:
    stripped_small_line = line.strip()
    column_small = stripped_small_line.split(",") 
    small_list.append(column_small)

big_list=[]

for line in openA:
    stripped_big_line = line.strip()
    column_big = stripped_big_line.split(",")
    big_list.append(column_big)

print small_list[2][1] #prints out coords only

Upvotes: 2

Answers (4)

hexparrot

Reputation: 3417

Here's an approach that uses dictionaries:

coords = {}

with open('first.txt', 'r') as first_list:
    for i in first_list:
        pair = [j for j in i.split(' ') if j]
        coords[','.join(pair[2:4])] = pair[0]
        #reformattted coords used as key "2.138180920454E+00,-1.234479330000E+00"

with open('second.txt', 'r') as second_list:
    for i in second_list:
        pair = [j for j in i.split(' ') if j]
        if ','.join(pair[2:4]) in coords:
            #reformatted coords from second list checked for presence in keys of dictionary
            print coords[','.join(pair[2:4])], pair[0]

What's going on here is that each of your coordinates from file A (which you have stated will be distinct), get stored into a dictionary as the key. Then, the first file is closed and the second file is opened. The second list's coordinates get opened, reformatted to match how the dictionary keys are saved and checks for membership. If the coordinate string from list B is in dictionary coords, the pair exists in both lists. It then prints the ID from the first and second list, regarding that match.

Dictionary lookups are much faster O(1). This approach also has the advantage of not needing to have all the data in memory in order to check (just one list) as well as not worrying about type-casting, e.g., float/int conversions.

Upvotes: 0

UpAndAdam

Reputation: 5467

Build two dictionaries the following way:

# do your splitting to populate two dictionaries of this format:
# mydata1[Coordinate] = ID

# i.e.
for line in data1.split():
    coord = line[2] + ' ' + line[3]
    id = line[0]
    mydata1[coord] = id
for line in data2.split():
    coord = line[2] + ' ' + line[3]
    id = line[0]
    mydata2[coord] = id


#then we can use set intersection to find all coordinates in both key sets
set1=set(mydata1.keys())
set2=set(mydata2.keys())
intersect = set1.intersection(set2)

for coordinate in intersect:
  print ' '.join(["Coordinate", str(coordinate), "found in set1 id", set1[coordinate]), "and set2 id", set2[coordinate])])

Upvotes: 0

user1907906

Reputation:

Use a dictionary with coordinates as keys.

data1 = """1   ,    0.000000000000E+00  0.000000000000E+00    
   2   ,   0.000000000000E+00  2.468958660000E+00  
   3   ,    0.000000000000E+00 -2.468958660000E+00  
   4   ,   2.138180920454E+00 -1.234479330000E+00  
   5   ,    2.138180920454E+00  1.234479330000E+00"""

# Read data1 into a list of tupes (id, x, y)
coords1 = [(int(line[0]), float(line[2]), float(line[3])) for line in
           (line.split() for line in data1.split("\n"))]

# This dictionary will map (x, y) -> id
coordsToIds = {}

# Add coords1 to this dictionary.
for id, x, y in coords1:
    coordsToIds[(x, y)] = id

# Read coords2 the same way.
# Left as an exercise to the reader.

# Look up each of coords2 in the dictionary.
for id, x, y in coords2:
    if (x, y) in coordsToIds:
        print(coordsToIds[(x, y)] # the ID in coords1

Beware that comparing floats is always a problem.

Upvotes: 2

Tadgh

Reputation: 2049

If all you are doing is trying to compare the second element of each element in two lists, that can be done by having each coord compared against each coord in the opposite file. This is definitely not the fastest way to go about it, but it should get you the results you need.It scans through small list, and checks every small_entry[1] (the coordinate) against every coordinate for each entry in big_list

for small_entry in small_list:
    for big_entry in big_list:
        if small_entry[1] == big_entry[1] :
            print(small_entry[0] + "matches" +  big_entry[0])

something like this?

Upvotes: 1

Identifying coordinate matches from two files using python

Answers (4)

Related Questions