Amelio Vazquez-Reina
Amelio Vazquez-Reina

Reputation: 96360

Finding the most tree-like hierarchy that explains the data

Consider the following dataframe:

      A  B  C
1    A1 B1 C1
2    A2 B2 C2
3    A3 B1 C1
4    A1 B1 C2
5    A2 B1 C1
6    A1 B4 C2

where A, B and C represent attributes. I am hoping to infer the most likely hierarchical structure between A, B and C. By this, I mean finding the ordering of {A,B,C} that yields a hierarchy with the least number of nodes with more than one parent.

For example, let's consider one hierarchical possibility:

A->B->C

We note that it has nodes with several parents. To see this, we observe that A1 co-occurs with B1 and B4 in the combinations A1 B1 C1 and A1 B4 C1. However, A3 also co-occurs with B1 in row 3, with A3 B1 C1.

In other words, only focusing on this part of the graph, if we assume the hierarchy A->B->C, we would have a node B1 with two parents:

                                          enter image description here

The question is thus, given an arbitrary dataframe like the one above, how can I find the hierarchical ordering of the columns that yields the least number of nodes with more than one parent?

Notes:

There are more variants to this problem, e.g.

  1. Finding the hierarchy with the least number of (extra) multi-parent edges
  2. Finding the hierarchy with the least total number of edges

Solving any of these variants would be great.

Upvotes: 3

Views: 288

Answers (1)

Ralor
Ralor

Reputation: 369

Here is undirected graph with your dataframe. Edge (x,y) means that there was some data line such that both x,y was mentioned.

For example - last line "A1,B4,C2" added edges (A1,B4), (B4,C2), (A1,C2)

Now it's possible to sort A,B,C according to your wishes.

Finding the hierarchy with the least number of (extra) multi-parent edges

We can bruteforce all arrangements (it's quite fast for N = 8..10) and find the cheapest (smallest, shortest) one. Edge costs in such tree (below) can be computed by graph shown above.

Mb there can be some greedy approach, smth like "Choose the cheapest on current step", I'm not sure right now, but I'm pretty sure that this representation of problem is prespective.

Upvotes: 2

Related Questions