Cypher query: group to get top n, then loop over all elements and find the shortest path to a member of the top n

Question

I'm trying to reduce the dimensionality for a large table (billions of records) containing pairs of values. Each record consists of a cookie and a URL, e.g.

cookie                                  url
000006A0-AC89-4D96-8091-AAA16A7990CB    www.cnn.com
000006A0-AC89-4D96-8091-AAA16A7990CB    www.bbc.co.uk
000006A0-AC89-4D96-8091-AAA16A7990CB    www.ynetnews.com
00001490-7A5C-4944-B556-0BD5DF8A262A    www.webmd.com
00001490-7A5C-4944-B556-0BD5DF8A262A    www.health.com
00001490-7A5C-4944-B556-0BD5DF8A262A    www.juicingforperfecthealth.com
...                                     ...

Most of the dimensionality reduction algorithms are designed to work on numerical data. R's daisy can calculate distance on categorical variables, but that certainly won't scale.

In the absense of a better idea, I loaded my data into Neo4j. My plan was to write a Cypher query that'll consolidate all the URL's into a few major groups. The cookies have a visited relationship to the URL's.

It's possible to return the top n URLs in terms of visits:

MATCH (cookie)-[r:visited]->(url)
RETURN (url), count(url)
ORDER BY count(url) DESC
LIMIT 20

Now the tricky part: iterate over each of the URLs and return the closest top-level URL (i.e. the top-level URL with the shortest path). The results would look like this:

url                                 closest_top_level_url
www.juicingforperfecthealth.com     www.webmd.com
www.ynetnews.com                    www.cnn.com
...                                 ...

Can you show me how to write this in Cypher or, alternatively, suggest a more sensible approach to reduce the dimensionality of a large list of string-pairs?

Martin Preusse · Accepted Answer

You have a bipartite graph of cookie and url, so by definition the closest url node must be 2 visited steps away.

That means you can get the closest url like this:

MATCH (cookie)-[r:visited]->(url)
WITH url, count(*) as freq
ORDER BY freq DESC
LIMIT 20
MATCH (url)<-[:visited]-(some_cookie)-[:visited]->(closest_top_level_url)
RETURN url, closest_top_level_url

'Reducing dimensionality' is another problem though and not really a question but a complex problem. It depends on what you want to achieve in the end. I guess you want to cluster the url nodes by some kind of similarity/dissimilarity measure? I.e. they should be grouped together if they are visited by the same cookies?

You could do that with a binary cookie x url matrix and different clustering methods: https://stats.stackexchange.com/questions/86318/clustering-a-binary-matrix

There are also a lot of methods for graph clustering on a bipartite graph.

Update

Technically, you can do what you ask in your comment. I would put a label on the top urls first:

MATCH (cookie)-[r:visited]->(url)
WITH url, count(*) as freq
ORDER BY freq DESC
LIMIT 20
SET url :Topurl

Then:

MATCH p = (some_url:url)-[:visited*]-(top_url:Topurl)
WITH top_url, some_url
ORDER BY length(p) desc
LIMIT 1
CREATE (some_url)-[:CLOSETO]-(top_url)

Two problems: the distance two one of the top 20 URLs can be very long if you have billions of rows. Also a single visit from one cookie to two 'distant' URLs completely changes your graph structure.

Cypher query: group to get top n, then loop over all elements and find the shortest path to a member of the top n

Answers (1)

Update

Related Questions