How to check if groups of data are connected in python

Question

Imagine we have a data set like this:

import pandas as pd
import numpy as np

df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3,4,4,4], 'Number':[234, 43, 455, 112, 45, 234, 982, 41, 112, 46, 109, 4]})

Now, if you run the code above, you'll see that ID's 1,2 and 3 are all 'connected', i.e. ID 1 is connected to ID 2 via Number == 234, and ID 2 is connected to ID 3 via Number == 112. But ID 4 is not connected to these other 3 ID's. So I would like to have a df_final like this

df_final = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3,4,4,4], 'Number':[234, 43, 455, 112, 45, 234, 982, 41, 112, 46, 109, 4], 'Group':[1,1,1,1,1,1,1,1,1,2,2,2]})

My data set is large, so I need some code that is as efficient as possible. I defined an 'is_a_neighbor' function to check if there is overlap between two ID's. It returns 'True' if two ID's are connected (i.e. have an overlap). I suppose I'll need to embed it in a loop somehow. Here's the function:

def is_a_neighbor(id_1, id_2):
  val = np.isin(df[df['ID'] == id_1][['Number']].to_numpy(), df[df['ID'] == id_2][['Number']].to_numpy()).any()
  return val

and a test run

# Test is_a_neighbor function
id_1 = 1
id_2 = 2

val = is_a_neighbor(id_1, id_2)
val

Which will return 'True'. Thanks for any help/guidance on this one.

Loic RW · Accepted Answer

The following code should work:

import pandas as pd
import networkx as nx

df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3,4,4,4], 
                   'Number':[234, 43, 455, 112, 45, 234, 982, 41, 112, 46, 109, 4]})

df.ID = df.ID.astype(str)
g = nx.from_pandas_edgelist(df,'ID','Number')
connected_components = nx.connected_components(g)
d = {y:i for i, x in enumerate(connected_components) for y in x}

df['group'] = df.ID.replace(d)
df.ID = df.ID.astype(int)

The reason why we convert the ID column to a string is because we want 1 in the ID column to be distinct from 1 in the Number column.

If we didn't do that the following example would show only one group:

ID	Number
1	3
2	3
3	4

To expand on the speed of the code, the following test resulted in an average time 3.4 seconds:

import pandas as pd
import numpy as np
import networkx as nx
import timeit

test_df = pd.DataFrame({'ID': np.random.randint(0, 10000, 10000), 
                        'Number': np.random.randint(0, 10000, 10000)})

def add_group(df):
    df.ID = df.ID.astype(str)
    g = nx.from_pandas_edgelist(df,'ID','Number')
    connected_components = nx.connected_components(g)
    d = {y:i for i, x in enumerate(connected_components) for y in x}

    df['group'] = df.ID.replace(d)
    df.ID = df.ID.astype(int)
    return df

timeit.timeit(lambda : add_group(test_df), number=100)

How to check if groups of data are connected in python

Answers (2)

Related Questions