Hierarchical Agglomerative Clustering in Prolog

Question

I'm trying to re-familiarize myself with Prolog and I thought this could be the type of problem with an elegant solution in Prolog.

I'm following along this example:

http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html

I've tried a variety of data formats:

dist('BA','FI',662).
dist(0,'BA','FI',662).
dist(['BA'],['FI'],662).

but I haven't found any particular one most suitable.

Here's all the data in the first format:

%% Graph distances
dist('BA','FI',662).
dist('BA','MI',877).
dist('BA','NA',255).
dist('BA','RM',412).
dist('BA','TO',996).

dist('FI','MI',295).
dist('FI','NA',468).
dist('FI','RM',268).
dist('FI','TO',400).

dist('MI','NA',754).
dist('MI','RM',564).
dist('MI','TO',138).

dist('NA','RM',219).
dist('NA','TO',869).

dist('RM','TO',669).

Now, there seems to be some awesome structure to this problem to exploit, but I'm really struggling to get a grasp of it. I think I've got the first cluster here (thought it may not be the most elegant way of doing it ;)

minDist(A,B,D) :- dist(A,B,D), dist(X,Y,Z), A \= X, A \= Y, B \= X, B \= Y, D < Z.

min(A,B,B) :- B < A
min(A,B,A) :- A < B

dist([A,B],C, D) :- minDist(A,B,D), dist(A,C,Q), dist(B,C,W), min(Q,W,D)

The problem I have here is the concept of "replacing" the dist statements involving A and B with the cluster.

This just quickly become a brainteaser for me and I'm stuck. Any ideas on how to formulate this? Or is this perhaps just not the kind of problem elegantly solved with Prolog?

Daniel Lyons · Accepted Answer

Your table is actually perfect! The problem is that you don't have an intermediate data structure. I'm guessing you'll find the following code pretty surprising. In Prolog, you can simply use whatever structures you want, and it will actually work. First let's get the preliminary we need for calculating distance without regard for argument order:

distance(X, Y, Dist) :- dist(X, Y, Dist) ; dist(Y, X, Dist).

This just swaps the order if it doesn't get a distance on the first try.

Another utility we'll need: the list of cities:

all_cities(['BA','FI','MI','NA','RM','TO']).

This is just helpful; we could compute it, but it would be tedious and weird looking.

OK, so the end of the linked article makes it clear that what is actually being created is a tree structure. The article doesn't show you the tree at all until you get to the end, so it isn't obvious that's what's going on in the merges. In Prolog, we can simply use the structure we want and there it is, and it will work. To demonstrate, let's enumerate the items in a tree with something like member/2 for lists:

% Our clustering forms a tree. So we need to be able to do some basic
% operations on the tree, like get all of the cities in the tree. This
% predicate shows how that is done, and shows what the structure of
% the cluster is going to look like.
cluster_member(X, leaf(X)).
cluster_member(X, cluster(Left, Right)) :-
    cluster_member(X, Left) ; cluster_member(X, Right).

So you can see we're going to be making use of trees using leaf('FI') for instance, to represent a leaf-node, a cluster of N=1, and cluster(X,Y) to represent a cluster tree with two branches. The code above lets you enumerate all the cities within a cluster, which we'll need to compute the minimum distance between them.

% To calculate the minimum distance between two cluster positions we
% need to basically pair up each city from each side of the cluster
% and find the minimum.
cluster_distance(X, Y, Distance) :-
    setof(D,
          XCity^YCity^(
              cluster_member(XCity, X),
              cluster_member(YCity, Y),
              distance(XCity, YCity, D)),
          [Distance|_]).

This probably looks pretty weird. I'm cheating here. The setof/3 metapredicate finds solutions for a particular goal. The calling pattern is something like setof(Template, Goal, Result) where the Result will become a list of Template for each Goal success. This is just like bagof/3 except that setof/3 gives you unique results. How does it do that? By sorting! My third argument is [Distance|_], saying just give me the first item in the result list. Because the result is sorted, the first item in the list will be the smallest. It's a big cheat!

The XCity^YCity^ notation says to setof/3: I don't care what these variables actually are. It marks them as "existential variables." This means Prolog will not provide multiple solutions for each city combination; they will all be thrown together and sorted once.

This is all we need to perform the clustering!

From the article, the base case is when you have two clusters left: just combine them:

% OK, the base case for clustering is that we have two items left, so
% we cluster them together.
cluster([Left,Right], cluster(Left,Right)).

The inductive case takes the list of results and finds the two which are nearest and combines them. Hold on!

% The inductive case is: pair up each cluster and find the minimum distance.
cluster(CityClusters, FinalCityClusters) :-
    CityClusters = [_,_,_|_], % ensure we have >2 clusters
    setof(result(D, cluster(N1,N2), CC2),
          CC1^(select(N1, CityClusters, CC1),
               select(N2, CC1, CC2),
               cluster_distance(N1, N2, D)),
          [result(_, NewCluster, Remainder)|_]),
    cluster([NewCluster|Remainder], FinalCityClusters).

Prolog's built-in sorting is to sort a structure on the first argument. We cheat again here by creating a new structure, result/3, which will contain the distance, the cluster with that distance, and the remaining items to be considered. select/3 is extremely handy. It works by pulling an item out of the list and then giving you back the list without that item. We use it twice here to select two items from the list (I don't have to worry about comparing a place to itself as a result!). CC1 is marked as a free variable. The result structures will be created for considering each possible cluster with the items we were given. Again, setof/3 will sort the list to make it unique, so the first item in the list will happen to be the one with the shortest distance. It's a lot of work for one setof/3 call, but I like to cheat!

The last line says, take the new cluster and append it to the remaining items, and forward it on recursively to ourself. The result of that invocation will eventually be the base case.

Now does it work? Let's make a quick-n-dirty main procedure to test it:

main :-
    setof(leaf(X), (all_cities(Cities), member(X, Cities)), Basis),
    cluster(Basis, Result),
    write(Result), nl.

Line one is a cheesy way to construct the initial conditions (all cities in their own cluster of one). Line two calls our predicate to cluster things. Then we write it out. What do we get? (Output manually indented for readability.)

cluster(
  cluster(
    leaf(FI),
    cluster(
      leaf(BA),
      cluster(
        leaf(NA),
        leaf(RM)))),
  cluster(
    leaf(MI),
    leaf(TO)))

The order is slightly different, but the result is the same!

If you're perplexed by my use of setof/3 (I would be!) then consider rewriting those predicates using the aggregate library or with simple recursive procedures that aggregate and find the minimum by hand.

Hierarchical Agglomerative Clustering in Prolog

Answers (1)

Related Questions