user3091644
user3091644

Reputation: 439

How to combine arrays only if they have a common value?

I have a number of pairs of values (pairs of bonded atoms) for a file containing different molecules. If two pairs have a common member, it means that they are part of the same molecule. I am trying to find an efficient way in python to group the atoms depending on which molecule they belong to.

As an example, ethane and methane would be:

1,5 and 9 would be carbon, the rest hydrogen

[[1,2],[1,3],[1,4],[1,5],[5,6],[5,7],[5,8],[9,10],[9,11],[9,12],[9,13]]

And I would like to get a list/array in which I have:

[[1,2,3,4,5,6,7,8],[9,10,11,12,13]]

I have tried several things but they are really ineficient for files with a large number of atoms. There should be a smart way of doing it but I can't find it. Any ideas?

Thanks, Joan

Upvotes: 1

Views: 92

Answers (3)

Marco Spinaci
Marco Spinaci

Reputation: 1879

If I understand correctly, what you're trying to do is to identify connected components of the graph, where each node is an atom and each edge is a bond (hence, one connected component is a molecule). There is an efficient implementation for this in scipy.sparse.csgraph.

So first let's set up the graph as a sparse matrix:

import scipy.sparse as sps

# Input as provided
edges = [[1,2],[1,3],[1,4],[1,5],[5,6],[5,7],[5,8],[9,10],[9,11],[9,12],[9,13]]
# Modify the input by adding, for each [x,y], also [y,x].
# Also transform it to a set and then again to a list
# to assure that we don't duplicate anything.
edges = list({(x[0],x[1]) for x in edges}.union({(x[1],x[0]) for x in edges}))
# Create it as a matrix. The weights of all edges are set to 1,
# as they don't matter anyway.
graph = sps.csr_matrix(([1]*len(edges), np.array(edges).T))

At this point, it's just a matter of calling scipy.sparse.csgraph.connected_components, but the output has a slightly different format by default:

(3, array([0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2]))

So let's modify it a bit:

from scipy.sparse import csgraph
connected_components = csgraph.connected_components(graph)
result = []

for u in range(1, connected_components[0]):
    result.append(np.where(connected_components[1]==u)[0])

result

[array([1, 2, 3, 4, 5, 6, 7, 8], dtype=int64),

array([ 9, 10, 11, 12, 13], dtype=int64)]

Also remark that in range I've started from 1, because Python standard counts from 0 and this would be found as an isolated node since you start from 1. If the numbering of the atoms is non-continuous, the one needs to skip the isolated nodes, for example by doing:

result = [r for r in result if len(r) > 1]

Upvotes: 1

zipa
zipa

Reputation: 27879

Here's another approach:

a = [[1,2],[1,3],[1,4],[1,5],[5,6],[5,7],[5,8],[9,10],[9,11],[9,12],[9,13]]

result = []

for sub in a:
    join = False
    for i, r in enumerate(result):
        if any([x in r for x in sub]):
            join = True
            index = i
    if join:
        result[index] += [y for y in sub if y not in result[index]]
    else:
        result.append(sub)

result
#[[1,2,3,4,5,6,7,8],[9,10,11,12,13]]

Upvotes: 0

Adi219
Adi219

Reputation: 4814

bigArr = [[1,2],[1,3],[1,4],[1,5],[5,6],[5,7],[5,8],[9,10],[9,11],[9,12],[9,13]] ## Your list of pairs of values
molArr = []
for pair in bigArr:
    flag = False
    for molecule in molArr:
        if pair[0] in molecule or pair[1] in molecule: ## Add both values if any of them are in the molecules list
            molecule.append(pair[0])
            molecule.append(pair[1])
            flag = True ## The values have been added to an existing list

    if not flag: ## The values weren't in an existing list so add them both
        molArr.append(pair)

i = 0
for i in range(len(molArr)): ## Remove duplicates in one loop
    molArr[i] = list(set(molArr[i]))

Upvotes: 0

Related Questions