Reputation: 4563
I'm trying to reduce the dimensionality for a large table (billions of records) containing pairs of values. Each record consists of a cookie and a URL, e.g.
cookie url
000006A0-AC89-4D96-8091-AAA16A7990CB www.cnn.com
000006A0-AC89-4D96-8091-AAA16A7990CB www.bbc.co.uk
000006A0-AC89-4D96-8091-AAA16A7990CB www.ynetnews.com
00001490-7A5C-4944-B556-0BD5DF8A262A www.webmd.com
00001490-7A5C-4944-B556-0BD5DF8A262A www.health.com
00001490-7A5C-4944-B556-0BD5DF8A262A www.juicingforperfecthealth.com
... ...
Most of the dimensionality reduction algorithms are designed to work on numerical data. R's daisy
can calculate distance on categorical variables, but that certainly won't scale.
In the absense of a better idea, I loaded my data into Neo4j. My plan was to write a Cypher query that'll consolidate all the URL's into a few major groups. The cookies have a visited
relationship to the URL's.
It's possible to return the top n URLs in terms of visits:
MATCH (cookie)-[r:visited]->(url)
RETURN (url), count(url)
ORDER BY count(url) DESC
LIMIT 20
Now the tricky part: iterate over each of the URLs and return the closest top-level URL (i.e. the top-level URL with the shortest path). The results would look like this:
url closest_top_level_url
www.juicingforperfecthealth.com www.webmd.com
www.ynetnews.com www.cnn.com
... ...
Can you show me how to write this in Cypher or, alternatively, suggest a more sensible approach to reduce the dimensionality of a large list of string-pairs?
Upvotes: 1
Views: 366
Reputation: 9369
You have a bipartite graph of cookie
and url
, so by definition the closest url
node must be 2 visited
steps away.
That means you can get the closest url like this:
MATCH (cookie)-[r:visited]->(url)
WITH url, count(*) as freq
ORDER BY freq DESC
LIMIT 20
MATCH (url)<-[:visited]-(some_cookie)-[:visited]->(closest_top_level_url)
RETURN url, closest_top_level_url
'Reducing dimensionality' is another problem though and not really a question but a complex problem. It depends on what you want to achieve in the end. I guess you want to cluster the url
nodes by some kind of similarity/dissimilarity measure? I.e. they should be grouped together if they are visited by the same cookies?
You could do that with a binary cookie x url matrix and different clustering methods: https://stats.stackexchange.com/questions/86318/clustering-a-binary-matrix
There are also a lot of methods for graph clustering on a bipartite graph.
Technically, you can do what you ask in your comment. I would put a label on the top urls first:
MATCH (cookie)-[r:visited]->(url)
WITH url, count(*) as freq
ORDER BY freq DESC
LIMIT 20
SET url :Topurl
Then:
MATCH p = (some_url:url)-[:visited*]-(top_url:Topurl)
WITH top_url, some_url
ORDER BY length(p) desc
LIMIT 1
CREATE (some_url)-[:CLOSETO]-(top_url)
Two problems: the distance two one of the top 20 URLs can be very long if you have billions of rows. Also a single visit from one cookie to two 'distant' URLs completely changes your graph structure.
Upvotes: 2