Reputation: 338
After preprocessing i have a final dataframe with columns 'timestamp', 'group', 'person1', 'person2'. I am trying to figure out how to code my requirement or want to know is it possible using python. What I am trying to extract is groups within each group. for example: in group G0, A is meeting with B, B meets with C, A meets with D. It means ABCD forms a group within the group. There can be multiple groups within each group (for example in group G1). How can I do this? what logic or code can I apply to extract this? I searched a lot, but it was not of any help..
The pic of dataframe sample and expected output is:
sample data:
df = pd.DataFrame(
{
"timestamp": ['25-06-2020 09:29','25-06-2020 09:29','25-06-2020 09:31','25-06-2020 09:32','25-06-2020 09:33','25-06-2020 09:33','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 12:29','25-06-2020 12:29','25-06-2020 12:30','25-06-2020 12:30'],
"group": ['G0','G0','G0','G0','G0','G0','G1','G1','G1','G1','G1','G2','G2','G2'],
"person1": ['A','A','B','A','X','Z','A','B','L','X','Y','L','N','O'],
"person2": ['B','B','C','D','Y','N','B','C','M','Y','Z','M','O','P']
}
)
Upvotes: 4
Views: 190
Reputation: 153460
You can do it using the networkx library graph theory and connected components:
import networkx as nx
import pandas as pd
df = pd.DataFrame(
{
"timestamp": ['25-06-2020 09:29','25-06-2020 09:29','25-06-2020 09:31','25-06-2020 09:32','25-06-2020 09:33','25-06-2020 09:33','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 12:29','25-06-2020 12:29','25-06-2020 12:30','25-06-2020 12:30'],
"group": ['G0','G0','G0','G0','G0','G0','G1','G1','G1','G1','G1','G2','G2','G2'],
"person1": ['A','A','B','A','X','Z','A','B','L','X','Y','L','N','O'],
"person2": ['B','B','C','D','Y','N','B','C','M','Y','Z','M','O','P']
}
)
def f(x):
G = nx.from_pandas_edgelist(x, 'person1', 'person2')
l = x.apply(lambda n: ''.join(nx.node_connected_component(G, n['person1'])), axis=1)
return l
df['subgroup'] = df.groupby('group').apply(f).to_numpy()
df
Output:
timestamp group person1 person2 subgroup
0 25-06-2020 09:29 G0 A B DACB
1 25-06-2020 09:29 G0 A B DACB
2 25-06-2020 09:31 G0 B C DACB
3 25-06-2020 09:32 G0 A D DACB
4 25-06-2020 09:33 G0 X Y YX
5 25-06-2020 09:33 G0 Z N NZ
6 25-06-2020 11:17 G1 A B ACB
7 25-06-2020 11:17 G1 B C ACB
8 25-06-2020 11:17 G1 L M ML
9 25-06-2020 11:17 G1 X Y ZYX
10 25-06-2020 12:29 G1 Y Z ZYX
11 25-06-2020 12:29 G2 L M ML
12 25-06-2020 12:30 G2 N O ONP
13 25-06-2020 12:30 G2 O P ONP
Groupby subgroups:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.groupby('subgroup')['timestamp'].agg(['min', 'max'])
Output:
min max
subgroup
ACB 2020-06-25 11:17:00 2020-06-25 11:17:00
DACB 2020-06-25 09:29:00 2020-06-25 09:32:00
ML 2020-06-25 11:17:00 2020-06-25 12:29:00
NZ 2020-06-25 09:33:00 2020-06-25 09:33:00
ONP 2020-06-25 12:30:00 2020-06-25 12:30:00
YX 2020-06-25 09:33:00 2020-06-25 09:33:00
ZYX 2020-06-25 11:17:00 2020-06-25 12:29:00
Upvotes: 3
Reputation: 145
This certainly is interesting stuff, especially given the current pandemic. It sounds like you need graph theory to help you. Python can do this through dictionaries and custom classes as this tutorial describes
Additionally, this out-of-date Python documentation is probably of some help too. You would need to tweak their find_all_graphs()
function for your requirements.
Upvotes: 1