roushrsh
roushrsh

Reputation: 91

Best way to find Order of Time series from matrix of groups

Is there a standard way of doing this?

I basically have some users, who performed actions together and split off as a group. We don't know the order of the events, but can infer them:

                      A B C D E
 WentToMall           1 1 1 0 0
 DroveToTheMovies     1 0 0 0 0
 AteLunchTogether     1 1 0 0 0     
 BoughtClothes        1 1 0 0 1
 BoughtElectronics    1 1 0 0 0

The rule is they can't converge together after. So the time series would look like:

Time 0 is always all of them together, then the largest 'grouping' is splitting off into 'WentToMall' where we get A,B,C and D,E split off.

From there, it looks like AB split off from C, and AB proceed to 'AteLunchTogether, BoughtClothes, BoughtElectronics'. Sometime during 'boughtclothes', it looks like E split off from D. Finally, A and B split off at the end as A 'Drove to the movies'.

If possible, I'd like to also show this visually, maybe with nodes showing the number of events separating the split (which would look like):

ABCDE   ---> ABC --> AB  ->A
 |             |      |->B 
 |             |
 |             |--> C
 |
 |
 |---> DE --> D
       |-->E

A problem that comes up is sometimes you get time points which are 'difficult to asses' or appear contradictory, and don't fit in based on the minimal amount of columns. I'm not sure what to do about those either. I am given 'weights' for the actions, so I could decide based on that, or I guess generate all versions of the graph.

I was thinking maybe of recursion to do a search through, or similar?

Upvotes: 0

Views: 78

Answers (1)

Gokturk Sahin
Gokturk Sahin

Reputation: 91

edit: the latest file is here

The process is through recursion. Pandas is useful in your scenario, though there might be a more efficient ways to do this.

We search from the further nodes. In your case, these would be A and E nodes. How we know these are the furthest nodes? Just count 0 and 1 values of all rows. Then get sum of 0 and 1 values. Also, sort values by 0. For first case, it should be like that:

                    0   1
DroveToTheMovies    4.0 1.0
AteLunchTogether    3.0 2.0
BoughtElectronics   3.0 2.0
WentToMall          2.0 3.0
BoughtClothes       2.0 3.0
FirstCase           0.0 5.0

This means there is 1 person drove to the movies. You see the pattern. There are people joining this person later on. In first case, there are 5 people which we began with. But there is a problem. How we know whether the previous person was in the group? Let's say X drove to the movies. Now we check for ate lunch. Say Y and Z joined the group, but not X. For this case, we will check if the latest group is in the new group. So until we reach first case, we add all the event to an array. So now we have a branch.

Assume there were people not in group case. In this case, we stored this odd behavior as well. Then, we go from there now. In first case our beginning node was A; not it is B using the same technique. So the process will be repeated again.

My final results were like that:

    0                   1
0   DroveToTheMovies    Index(['A'], dtype='object')
1   AteLunchTogether    Index(['A', 'B'], dtype='object')
2   BoughtElectronics   Index(['A', 'B'], dtype='object')
3   WentToMall          Index(['A', 'B', 'C'], dtype='object')
4   FirstCase           Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
5   BoughtClothes       Index(['E'], dtype='object')
6   FirstCase           Index(['D', 'E'], dtype='object')

There are two FirstCase. But you need to process these two FirstCase values and know that, this D-E group is from the first FirstCase group, then E went for bought clothes. D is unknown, therefore could be assigned as something else. And there you have it.

First branch:

ABCDE   ---> ABC --> AB  ->A
               |      |->B 
               |
               |--> C

Second branch:

(first case)---> DE --> D
                  |-->E

All you have to do is now find who left branches. For first branch it is B, C, and D-E. These are easy to calculate from now on. Hope it'll help you. The code is here, and I suggest to to debug the code in order to get the whole idea more clear:

import pandas as pd

df = pd.DataFrame(
   [[1, 1, 1, 0, 0],
    [1, 0, 0, 0, 0],
    [1, 1, 0, 0, 0],    
    [1, 1, 0, 0, 1],
    [1, 1, 0, 0, 0]], columns=list("ABCDE"))

df.index = ['WentToMall', 'DroveToTheMovies', 'AteLunchTogether', 'BoughtClothes', 'BoughtElectronics']

first_case = pd.DataFrame(
   [[1, 1, 1, 1, 1]], columns=list("ABCDE"), index=['FirstCase'])

all_case = pd.concat([first_case, df])

def case_finder(all_case):
    df_case = all_case.apply(lambda x: x.value_counts(), axis=1).fillna(0)
    df_case = df_case.loc[df_case[1] != 0]
    return df_case.sort_values(by=1)


def check_together(x):
    x = df.iloc[x]
    activity = all_case.loc[x.name]
    does_activity = activity.loc[activity == 1]
    return activity.name, does_activity.index


def check_in(pre, now):
    return pre.isin(now).all()

def check_odd(i):
    act = check_together(i)[0]
    who = check_together(i)[1][~check_together(i)[1].isin(check_together(i-1)[1])]
    return act, who

df = case_finder(all_case)
total = all_case.shape[0]


all_acts = []
last_stable = []
while True:
    for i in range(total):
        act, ind = check_together(i)
        if ind.size == 1:
            print("Initiliazed!")
            all_acts.append([act, ind])
            pass
        else:
            p_act, p_ind = check_together(i-1)
            if check_in(p_ind, ind) == True:
                print("So a new person joins us!")
                all_acts.append([act, ind])
            else:
                print("This is weird. We'll check later!")
                # act, who = check_odd(i)
                last_stable.append([i, p_ind])
                continue
        if act == 'FirstCase':
            break

    if len(last_stable) == 0:
        print("Process done!")
        break
    else:
        print("Update cases!")
        ls_ind = last_stable[0]
        all_case = all_case.drop(last_stable[0][1], axis=1)
        total = all_case.shape[0]
        df = case_finder(all_case)
        last_stable = last_stable[1:]

print(all_acts)

x = pd.DataFrame(all_acts)

Upvotes: 1

Related Questions