Reputation: 333
I am working on a little project and I have a dataset of about 60k nodes and 500k relationships between those nodes. The nodes are of two types. First type are are recipes and the second type are ingredients. Recipes are composed of ingredients like:
(ingredient)-[:IS_PART_OF]->(recipe)
My objective is to find how many common ingredients two recipes share. I have managed to obtain this information with the following query that compares one recipe to all others (the first one with all others):
MATCH (recipe:RECIPE{ ID: 1000000 }),(other)
WHERE (other.ID >= 1000001 AND other.ID <= 1057690)
OPTIONAL MATCH (recipe:RECIPE)<-[:IS_PART_OF]-(ingredient:INGREDIENT)- [:IS_PART_OF]->(other)
WITH ingredient, other
RETURN other.ID, count(distinct ingredient.name)
ORDER BY other.ID DESC
My first question: How can I obtain the number of all ingredients of two recipes in a way that the mutual ones are counted only once (union of R1 and R2 --> R1 U R2)
My second question: is it possible to write a loop that would iterate through all the recipes and check for common ingredients? The objective is to compare each recipe with all others. I think this should return (n-1)*(n/2) rows.
I have tried the above and the problem remains. Even with LIMIT
and SKIP
I can not run the code on the whole set. I have changed my query so it allows me to partition my set accordingly:
MATCH (recipe1)<-[:IS_PART_OF]-(ingredient:INGREDIENT)-[:IS_PART_OF]->(recipe2)
WHERE (recipe2.ID >= 1000000 AND recipe2.ID <= 1000009) AND (recipe1.ID >= 1000000 AND recipe1.ID <= 1000009) AND (recipe1.ID < recipe2.ID)
RETURN recipe1.ID, count(distinct ingredient.name) AS MutualIngredients, recipe2.ID
ORDER BY recipe1.ID
Until I get my hands on a better machine this will suffice.
I still haven't solved my first question: how can I obtain the number of all ingredients of two recipes in a way that the mutual ones are counted only once (union of R1 and R2 --> R1 U R2)
Upvotes: 3
Views: 195
Reputation: 333
SOLVED: Just to share it if someone else will need it:
MATCH (recipe1)<-[:IS_PART_OF]-(ingredient:INGREDIENT)-[:IS_PART_OF]->(recipe2)
MATCH (recipe1)<-[:IS_PART_OF]-(ingredient1:INGREDIENT)
MATCH (recipe2)<-[:IS_PART_OF]-(ingredient2:INGREDIENT)
WHERE (recipe2.ID >= 1000000 AND recipe2.ID <= 1000009) AND (recipe1.ID >= 1000000 AND recipe1.ID <= 1000009) AND (recipe1.ID < recipe2.ID)
RETURN recipe1.ID, count(distinct ingredient1.name) + count(distinct ingredient2.name) - count(distinct ingredient.name) AS RecipesUnion, recipe2.ID
ORDER BY recipe1.ID
Upvotes: 0
Reputation: 18002
You'll need to play with this, but it's going to be something similar to this:
MATCH (recipe1:RECIPE)<-[:IS_PART_OF]-(ingred:INGREDIENT)-[:IS_PART_OF]->(recipe2:RECIPE)
WHERE ID(recipe1) < ID(recipe2)
RETURN recipe1, collect(ingred.name), recipe2
ORDER BY recipe1.ID
The match pattern gets you all of the common ingredients between two recipes. The WHERE
clause ensures that you're not comparing a recipe to itself (because it would share all ingredients with itself). The return clause just gives you the two recipes you're comparing, and what they have in common.
This will be O(n^2) though, and will be very slow.
UPDATE took Nicole's suggestion, which is a good one. That should guarantee each pair is only considered once.
Upvotes: 2