Reputation: 91
Is there a standard way of doing this?
I basically have some users, who performed actions together and split off as a group. We don't know the order of the events, but can infer them:
A B C D E
WentToMall 1 1 1 0 0
DroveToTheMovies 1 0 0 0 0
AteLunchTogether 1 1 0 0 0
BoughtClothes 1 1 0 0 1
BoughtElectronics 1 1 0 0 0
The rule is they can't converge together after. So the time series would look like:
Time 0 is always all of them together, then the largest 'grouping' is splitting off into 'WentToMall' where we get A,B,C and D,E split off.
From there, it looks like AB split off from C, and AB proceed to 'AteLunchTogether, BoughtClothes, BoughtElectronics'. Sometime during 'boughtclothes', it looks like E split off from D. Finally, A and B split off at the end as A 'Drove to the movies'.
If possible, I'd like to also show this visually, maybe with nodes showing the number of events separating the split (which would look like):
ABCDE ---> ABC --> AB ->A
| | |->B
| |
| |--> C
|
|
|---> DE --> D
|-->E
A problem that comes up is sometimes you get time points which are 'difficult to asses' or appear contradictory, and don't fit in based on the minimal amount of columns. I'm not sure what to do about those either. I am given 'weights' for the actions, so I could decide based on that, or I guess generate all versions of the graph.
I was thinking maybe of recursion to do a search through, or similar?
Upvotes: 0
Views: 78
Reputation: 91
edit: the latest file is here
The process is through recursion. Pandas is useful in your scenario, though there might be a more efficient ways to do this.
We search from the further nodes. In your case, these would be A and E nodes. How we know these are the furthest nodes? Just count 0 and 1 values of all rows. Then get sum of 0 and 1 values. Also, sort values by 0. For first case, it should be like that:
0 1
DroveToTheMovies 4.0 1.0
AteLunchTogether 3.0 2.0
BoughtElectronics 3.0 2.0
WentToMall 2.0 3.0
BoughtClothes 2.0 3.0
FirstCase 0.0 5.0
This means there is 1 person drove to the movies. You see the pattern. There are people joining this person later on. In first case, there are 5 people which we began with. But there is a problem. How we know whether the previous person was in the group? Let's say X drove to the movies. Now we check for ate lunch. Say Y and Z joined the group, but not X. For this case, we will check if the latest group is in the new group. So until we reach first case, we add all the event to an array. So now we have a branch.
Assume there were people not in group case. In this case, we stored this odd behavior as well. Then, we go from there now. In first case our beginning node was A; not it is B using the same technique. So the process will be repeated again.
My final results were like that:
0 1
0 DroveToTheMovies Index(['A'], dtype='object')
1 AteLunchTogether Index(['A', 'B'], dtype='object')
2 BoughtElectronics Index(['A', 'B'], dtype='object')
3 WentToMall Index(['A', 'B', 'C'], dtype='object')
4 FirstCase Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
5 BoughtClothes Index(['E'], dtype='object')
6 FirstCase Index(['D', 'E'], dtype='object')
There are two FirstCase. But you need to process these two FirstCase values and know that, this D-E group is from the first FirstCase group, then E went for bought clothes. D is unknown, therefore could be assigned as something else. And there you have it.
First branch:
ABCDE ---> ABC --> AB ->A
| |->B
|
|--> C
Second branch:
(first case)---> DE --> D
|-->E
All you have to do is now find who left branches. For first branch it is B, C, and D-E. These are easy to calculate from now on. Hope it'll help you. The code is here, and I suggest to to debug the code in order to get the whole idea more clear:
import pandas as pd
df = pd.DataFrame(
[[1, 1, 1, 0, 0],
[1, 0, 0, 0, 0],
[1, 1, 0, 0, 0],
[1, 1, 0, 0, 1],
[1, 1, 0, 0, 0]], columns=list("ABCDE"))
df.index = ['WentToMall', 'DroveToTheMovies', 'AteLunchTogether', 'BoughtClothes', 'BoughtElectronics']
first_case = pd.DataFrame(
[[1, 1, 1, 1, 1]], columns=list("ABCDE"), index=['FirstCase'])
all_case = pd.concat([first_case, df])
def case_finder(all_case):
df_case = all_case.apply(lambda x: x.value_counts(), axis=1).fillna(0)
df_case = df_case.loc[df_case[1] != 0]
return df_case.sort_values(by=1)
def check_together(x):
x = df.iloc[x]
activity = all_case.loc[x.name]
does_activity = activity.loc[activity == 1]
return activity.name, does_activity.index
def check_in(pre, now):
return pre.isin(now).all()
def check_odd(i):
act = check_together(i)[0]
who = check_together(i)[1][~check_together(i)[1].isin(check_together(i-1)[1])]
return act, who
df = case_finder(all_case)
total = all_case.shape[0]
all_acts = []
last_stable = []
while True:
for i in range(total):
act, ind = check_together(i)
if ind.size == 1:
print("Initiliazed!")
all_acts.append([act, ind])
pass
else:
p_act, p_ind = check_together(i-1)
if check_in(p_ind, ind) == True:
print("So a new person joins us!")
all_acts.append([act, ind])
else:
print("This is weird. We'll check later!")
# act, who = check_odd(i)
last_stable.append([i, p_ind])
continue
if act == 'FirstCase':
break
if len(last_stable) == 0:
print("Process done!")
break
else:
print("Update cases!")
ls_ind = last_stable[0]
all_case = all_case.drop(last_stable[0][1], axis=1)
total = all_case.shape[0]
df = case_finder(all_case)
last_stable = last_stable[1:]
print(all_acts)
x = pd.DataFrame(all_acts)
Upvotes: 1