clanofsol
clanofsol

Reputation: 93

Obtain pairs of nodes having exactly one relationship of a certain type which connects them to each other in Cypher

I have a graph database in Neo4j with drugs and drug-drug interactions, among other entities. In this regard, ()-[:IS_PARTICIPANT_IN]->() connects a drug to an interaction. I need to obtain those pairs of drugs a and b which are not involved in any other :IS_PARTICIPANT_IN relationship other than the one between them, i.e. (a)-[:IS_PARTICIPANT_IN]->(ddi:DrugDrugInteraction)<-[:IS_PARTICIPANT_IN]-(b), without any other IS_PARTICIPANT_IN relationships involving neither a nor b.

For that purpose, I have tried the following Cypher query. However, it ends up reaching heap size (raised to 8 GB), as collect operations consume too much memory.

MATCH (drug1:Drug)-[r1:IS_PARTICIPANT_IN]->(ddi:DrugDrugInteraction)
MATCH (drug2:Drug)-[r2:IS_PARTICIPANT_IN]->(ddi)
WHERE drug1 <> drug2 
OPTIONAL MATCH (drug2)-[r3:IS_PARTICIPANT_IN]->(furtherDDI:DrugDrugInteraction)
WHERE furtherDDI <> ddi
WITH drug1, drug2, ddi, COLLECT(ddi) AS ddis, furtherDDI, COLLECT(furtherDDI) AS additionalDDIs
WITH drug1, drug2, ddi, COUNT(ddis) AS n1, COUNT(additionalDDIs) AS n2
WHERE n1 = 1 AND n2 = 0
RETURN drug1.name, drug2.name, ddi.name ORDER BY drug1;

How can I improve my code so as to get the desired results without exceeding the heap size limit?

Upvotes: 0

Views: 66

Answers (1)

cybersam
cybersam

Reputation: 66989

This should work:

MATCH (d:Drug)
WHERE SIZE((d)-[:IS_PARTICIPANT_IN]->()) = 1
MATCH (d)-[:IS_PARTICIPANT_IN]->(ddi)
RETURN ddi.name AS ddiName, COLLECT(d.name) AS drugNames
ORDER BY drugNames[0]

The WHERE clause uses a very efficient degreeness check to filter for Drug nodes that have only a single outgoing IS_PARTICIPANT_IN relationship. This check is efficient because it does not have to actually get any DrugDrugInteraction nodes.

After the degreeness check, the query performs a second MATCH to actually get the associated DrugDrugInteraction node. (I assume that the IS_PARTICIPANT_IN relationship only points at DrugDrugInteraction nodes, and have therefore omitted the label from the search pattern, for efficiency).

The RETURN clause uses the aggregating function COLLECT to collect the Drug names for each ddi name. (I assume that ddi nodes have unique names.)

By the way, this query will also work if there are any number of Drugs (not just 2) that participate in the same DrugDrugInteraction, and no other ones. Also, if a matched DrugDrugInteraction happens to have a related Drug that participates in other interactions, this query will not include that Drug in the result (since this query only pays attention to d nodes that passed the initial degreeness check).

Upvotes: 3

Related Questions