Reputation: 3363
I am investigating an algorithm for similar matches and am trying to work out if a graph database would be the best data model for my solution. Let's use "find a similar car" as an example.
If we had car data like:
Owner | Make | Model | Engine | Colour
Jeff | Ford | Focus | 1400cc | Light Red
Bob | Ford | Focus | 1800cc | Dark Red
Paul | Ford | Mondeo | 2000cc | Blue
My understanding is that a graph database would be extremely performant with queries like:
Get me all owners who own a car of the same make as Jeff
Because you would start at the 'Jeff' node, follow the 'Make' edge to the 'Ford' node, and from here follow all the 'Owner' edges to get all people that own a Ford.
Now my question is would it be performant to do "Similar" lookups, eg:
Get me all owners whose car is within 500cc of Jeff
Presumably if you had "1400cc" as an Engine node, you would not be able to traverse the graph from here to find other Engines of a similar size, and so it would not be performant. My thinking is you would have to run some sort of overnight batch to create new edges between all Engine nodes, with the size difference between those two engines.
Have I understood correctly? Does a graph database seem like a good fit, or is there some other storage / retrieval / analysis method that would fit exactly to this problem?
What about in the case where I want to see the top 10 most similar cars, and my algorithm for similarity is something like "Start at 100%, deduct 2% for every 100cc difference, deduct 20% for different model, deduct 30% for different make, deduct 20% for different colour (or 5% if it's different shades of the same colour)". The only way I can think of doing this currently so that an application would be performant, would be to have a background task constantly iterating through the entire dataset and creating "similarity score" edges between every Owner.
Obviously with small datasets the solution doesn't really matter as any hodge-podge will be performant, but eventually we will have potentially hundreds of thousands of cars.
Any thoughts appreciated!
Upvotes: 0
Views: 117
Reputation: 67044
To get you started, here is a simple model, illustrated using sample data for "Jeff":
(make:Make {name: "Ford"})-[:MAKES]->(model:Model {name: "Focus", cc: 14000, year: 2016})
(o:Owner {name: "Jeff"})-[:OWNS]->(v:Vehicle {vin: "WVWZZZ6XZXW068123", plate: "ABC123", color: "Light Red"})-[:MODEL]->(model)
To get all owners who own a car of the same make as Jeff
:
MATCH (o1:Owner { name: "Jeff" })-[:OWNS]->(:Vehicle)-[:MODEL]->(model:Model)<-[:MAKES]-(make:Make)
MATCH (make)-[:MAKES]->(:Model)<-[:MODEL]-(:Vehicle)<-[:OWNS]-(owners:Owner)
RETURN DISTINCT owners;
To get all owners whose car is within 500cc of Jeff
:
MATCH (o1:Owner { name: "Jeff" })-[:OWNS]->(:Vehicle)-[:MODEL]->(model:Model)<-[:MAKES]-(make:Make)
MATCH (make)-[:MAKES]->(x:Model)
WHERE (x.cc >= model.cc - 500) AND (x.cc <= model.cc + 500)
MATCH (x)<-[:MODEL]-(:Vehicle)<-[:OWNS]-(owners:Owner)
RETURN DISTINCT owners;
The above queries will be a bit faster if you first create an index on :Owner(name)
:
CREATE INDEX ON :Owner(name);
Upvotes: 1
Reputation: 8731
As @manonthemat said in the comments, there's no best answer for your question, but I'll try to provide you a datamodel to help you :
First of all, you have to know which properties will be "the same" on your matches, like this :
Get me all owners who own a car of the same make as Jeff
Here, you'll want to create one Node per Make, and create a relationship from each car to show their brand.
Example data model for this use case:
You can still create one node per property value, but it's not always the best since if you have an infinite property value possibilities, you'll have to create one node per value.
Keep in mind that Graph Databases are really good for data modeling, because their relationship management is really easy to understand and use. So everything is about data model, and each data model is unique. This guide should help you.
Upvotes: 0