Reputation: 439
I have a number of pairs of values (pairs of bonded atoms) for a file containing different molecules. If two pairs have a common member, it means that they are part of the same molecule. I am trying to find an efficient way in python to group the atoms depending on which molecule they belong to.
As an example, ethane and methane would be:
1,5
and 9
would be carbon, the rest hydrogen
[[1,2],[1,3],[1,4],[1,5],[5,6],[5,7],[5,8],[9,10],[9,11],[9,12],[9,13]]
And I would like to get a list/array in which I have:
[[1,2,3,4,5,6,7,8],[9,10,11,12,13]]
I have tried several things but they are really ineficient for files with a large number of atoms. There should be a smart way of doing it but I can't find it. Any ideas?
Thanks, Joan
Upvotes: 1
Views: 92
Reputation: 1879
If I understand correctly, what you're trying to do is to identify connected components of the graph, where each node is an atom and each edge is a bond (hence, one connected component is a molecule). There is an efficient implementation for this in scipy.sparse.csgraph
.
So first let's set up the graph as a sparse matrix:
import scipy.sparse as sps
# Input as provided
edges = [[1,2],[1,3],[1,4],[1,5],[5,6],[5,7],[5,8],[9,10],[9,11],[9,12],[9,13]]
# Modify the input by adding, for each [x,y], also [y,x].
# Also transform it to a set and then again to a list
# to assure that we don't duplicate anything.
edges = list({(x[0],x[1]) for x in edges}.union({(x[1],x[0]) for x in edges}))
# Create it as a matrix. The weights of all edges are set to 1,
# as they don't matter anyway.
graph = sps.csr_matrix(([1]*len(edges), np.array(edges).T))
At this point, it's just a matter of calling scipy.sparse.csgraph.connected_components
, but the output has a slightly different format by default:
(3, array([0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]))
So let's modify it a bit:
from scipy.sparse import csgraph
connected_components = csgraph.connected_components(graph)
result = []
for u in range(1, connected_components[0]):
result.append(np.where(connected_components[1]==u)[0])
result
[array([1, 2, 3, 4, 5, 6, 7, 8], dtype=int64),
array([ 9, 10, 11, 12, 13], dtype=int64)]
Also remark that in range
I've started from 1, because Python standard counts from 0 and this would be found as an isolated node since you start from 1. If the numbering of the atoms is non-continuous, the one needs to skip the isolated nodes, for example by doing:
result = [r for r in result if len(r) > 1]
Upvotes: 1
Reputation: 27879
Here's another approach:
a = [[1,2],[1,3],[1,4],[1,5],[5,6],[5,7],[5,8],[9,10],[9,11],[9,12],[9,13]]
result = []
for sub in a:
join = False
for i, r in enumerate(result):
if any([x in r for x in sub]):
join = True
index = i
if join:
result[index] += [y for y in sub if y not in result[index]]
else:
result.append(sub)
result
#[[1,2,3,4,5,6,7,8],[9,10,11,12,13]]
Upvotes: 0
Reputation: 4814
bigArr = [[1,2],[1,3],[1,4],[1,5],[5,6],[5,7],[5,8],[9,10],[9,11],[9,12],[9,13]] ## Your list of pairs of values
molArr = []
for pair in bigArr:
flag = False
for molecule in molArr:
if pair[0] in molecule or pair[1] in molecule: ## Add both values if any of them are in the molecules list
molecule.append(pair[0])
molecule.append(pair[1])
flag = True ## The values have been added to an existing list
if not flag: ## The values weren't in an existing list so add them both
molArr.append(pair)
i = 0
for i in range(len(molArr)): ## Remove duplicates in one loop
molArr[i] = list(set(molArr[i]))
Upvotes: 0