Reputation: 96360
Consider the following dataframe:
A B C
1 A1 B1 C1
2 A2 B2 C2
3 A3 B1 C1
4 A1 B1 C2
5 A2 B1 C1
6 A1 B4 C2
where A
, B
and C
represent attributes. I am hoping to infer the most likely hierarchical structure between A
, B
and C
. By this, I mean finding the ordering of {A,B,C}
that yields a hierarchy with the least number of nodes with more than one parent.
For example, let's consider one hierarchical possibility:
A->B->C
We note that it has nodes with several parents. To see this, we observe that A1
co-occurs with B1
and B4
in the combinations A1 B1 C1
and A1 B4 C1
. However, A3
also co-occurs with B1
in row 3
, with A3 B1 C1
.
In other words, only focusing on this part of the graph, if we assume the hierarchy A->B->C
, we would have a node B1
with two parents:
The question is thus, given an arbitrary dataframe like the one above, how can I find the hierarchical ordering of the columns that yields the least number of nodes with more than one parent?
There are more variants to this problem, e.g.
Solving any of these variants would be great.
Upvotes: 3
Views: 288
Reputation: 369
Here is undirected graph with your dataframe. Edge (x,y) means that there was some data line such that both x,y was mentioned.
For example - last line "A1,B4,C2" added edges (A1,B4), (B4,C2), (A1,C2)
Now it's possible to sort A,B,C according to your wishes.
Finding the hierarchy with the least number of (extra) multi-parent edges
We can bruteforce all arrangements (it's quite fast for N = 8..10) and find the cheapest (smallest, shortest) one. Edge costs in such tree (below) can be computed by graph shown above.
Mb there can be some greedy approach, smth like "Choose the cheapest on current step", I'm not sure right now, but I'm pretty sure that this representation of problem is prespective.
Upvotes: 2