Graph algorithm to find the most likely ancestor of a node

Question

I'm working on the Wikipedia Category Graph (WCG). In the WCG, each article is associated to multiple categories. For example, the article "Lists_of_Israeli_footballers" is linked to multiple categories, such as :

Lists of association football players by nationality - Israeli footballers - Association football in Israel lists

Now, if you climb back the category tree, you are likely to find a lot of paths climbing up to the "Football" category, but there is also at least one path leading up to "Science" for example.

This is problematic because my final goal is to be able to determinate whether or not an article belongs to a given Category using the list of categories it's linked with : right now a simple ancestor search gives false positives (for example : identifies "Israeli footballers" as part of the "Science" category - which is obviously not the expected result).

I want an algorithm able to find out what the most likely ancestor is.

I thought about two main solutions :

Count the number of distinct paths in the WCG linking article's category vertices to the candidate ancestor category (and use number of paths linking to other categories of same depth for comparison)
Use some kind of clustering algorithm and make ancestor search queries in isolated graph spaces

The issue with those options is that they seem to be very costly considering the size of the WCG (2 million vertices - even more edges). Eventually, I could work with a solution that uses a preprocessing algorithm in O(n) or more to achieve O(1) later, but I need the queries to be overall very fast.

Are there existing solutions to my problem ? Open to all suggestions.

Glubus · Accepted Answer

Np, thanks for clarifying. anything like clustering is probably not a good idea, because those type of algorithms are meant to determine a category for an object that is not associated with a category yet. In your problem all objects (footballer article) is already associated to different categories.

You should probably do a complete search through all articles and save the matched categories with each article in a hash table so that you can then retrieve this category information when you need to know this for a new article.

Whether or not a category is relevant for an article seems totally arbitrary to me and seems to be something you should decide for yourself (e.g. determine a threshhold of 5 links to a category before it is determined part of the category).

If you're getting these articles from wikipedia you're probably going to have a pretty long run working through the entire tree, but in my opinion it seems like it's your only choice.

Search with DFS, and each time you find an arcticle-category match save the article in a hashtable (you need to be able to reduce an article to a unique identifier).

This is probably my most vague answer I've ever posted here, and your question might be too broad... if you're not helped with this please let me know so I can consider removing it in order to avoid confusion with future readers.

Graph algorithm to find the most likely ancestor of a node

Answers (1)

Related Questions