Neo4j Link prediction ML Pipeline

Question

I am working on a use case predict relation between nodes but of different type. I have a graph something like this.

(:customer)-[:has]->(:session) (:session)-[:contains]->(:order) (:order)-[:has]->(:product) (:order)-[:to]->(:relation)

There are many customers who have placed orders. Some of the orders specify to whom the order was intended to (relation) i.e., mother/father etc. and some orders do not. For these orders my intention is to predict to whom the order was likely intended to.

I have prepared a Link Prediction ML pipeline on neo4j. The gds.beta.pipeline.linkPrediction.predict.mutate procedure has 2 ways of prediction: Exhaustive search, Approximate search. The first one predicts for all unconnected nodes and the second one applies KNN to predict. I do not want both; rather I want the model to predict the link only between 2 specific nodes 'order' node and 'relation' node. How do I specify this in the predict procedure?

GKPD · Accepted Answer

You can also frame this problem as node classification and get what you are looking for. Treat Relation as the target variable and it will become a multi class classification problem. Let's say that Relation is a categorical variable with a few types (Mother/Father/Sibling/Friend etc.) and the hypothesis is that based on the properties on the Customer and the Order nodes, we can predict which relation a certain order is intended to.

Some of the examples of the properties of Customer nodes are age, location, billing address etc., and the properties of the Order nodes are category, description, shipped address, billing address, location etc. Properties of Session nodes are probably not useful in predicting anything about the order or the relation that order is intended to.

For running any algorithm in Neo4j, we have to project a graph into memory. Some of the properties on Customer and Order nodes are strings and graph projection procs do not support projecting strings into memory. Hence, the strings have to be converted into numerical values.

For example, Customer age can be used as is but the order description has to be converted into a word/phrase embedding using some NLP methodology etc. Some creative feature engineering also helps - instead of encoding billing/shipping addresses, a simple flag to identify if they are the same or different makes it easier to differentiate if the customer is shipping the order to his/her own address or to somewhere else.

Since we are using Relation as a target variable, let's label encode the relation type and add that as a class label property on Order nodes where relationship to Relation node exists (labelled examples). For all other orders, add a class label property as 0 (or any other number other than the label encoded relation type)

Now, project a graph with Customer, Session and Order nodes along with the properties of interest into memory. Since we are not using Session nodes in our prediction task, we can collapse the path between Customer and Order nodes. One customer can connect to multiple orders via multiple session nodes and orders are unique. Collapse path procedure will not result in multiple relationships between a customer and an order node and hence, aggregation is not needed.

You can now use Node classification ML pipeline in Neo4j GDS library to generate embeddings and use embedding property on Order node as a feature vector and class label property as target and train a multi class classification model to predict the class that particular order belongs to or the likelihood that particular order is intended to some relation type.

Neo4j Link prediction ML Pipeline

Answers (2)

Related Questions