jim
jim

Reputation: 315

How to check if groups of data are connected in python

Imagine we have a data set like this:

import pandas as pd
import numpy as np

df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3,4,4,4], 'Number':[234, 43, 455, 112, 45, 234, 982, 41, 112, 46, 109, 4]})

Now, if you run the code above, you'll see that ID's 1,2 and 3 are all 'connected', i.e. ID 1 is connected to ID 2 via Number == 234, and ID 2 is connected to ID 3 via Number == 112. But ID 4 is not connected to these other 3 ID's. So I would like to have a df_final like this

df_final = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3,4,4,4], 'Number':[234, 43, 455, 112, 45, 234, 982, 41, 112, 46, 109, 4], 'Group':[1,1,1,1,1,1,1,1,1,2,2,2]})

My data set is large, so I need some code that is as efficient as possible. I defined an 'is_a_neighbor' function to check if there is overlap between two ID's. It returns 'True' if two ID's are connected (i.e. have an overlap). I suppose I'll need to embed it in a loop somehow. Here's the function:

def is_a_neighbor(id_1, id_2):
  val = np.isin(df[df['ID'] == id_1][['Number']].to_numpy(), df[df['ID'] == id_2][['Number']].to_numpy()).any()
  return val

and a test run

# Test is_a_neighbor function
id_1 = 1
id_2 = 2

val = is_a_neighbor(id_1, id_2)
val

Which will return 'True'. Thanks for any help/guidance on this one.

Upvotes: 1

Views: 219

Answers (2)

Roeland Rengelink
Roeland Rengelink

Reputation: 134

As far as I can tell you have a long list of pairs (Id, Number) and two different Id's belong to the same Group if there is any number N such that both (i1, N) and (id2, N) exist.

I don't know why you think you need either panda or numpy for this

Why not something like:

data = [(1, 234), (1, 43), (1, 455), (1, 112), (2, 234), ...]


# Create sets of numbers for each index

number_sets = {}
for (index, number) in data:
    # if we've not seen the index before we create an empty set
    if index not in number_sets:
        number_sets[index] = set()

    number_sets[index].add(number)

Now we create a list of groups where for each group we record a list of indices and the set of a numbers found with any of those indices

groups = []
    
for (index, numbers) in number_sets.items():
    for (index_list, superset) in groups:
        # if any of the numbers are in the superset then this index belongs to this group
        if not superset.isdisjoint(numbers):
           index_list.append(index)
           superset.join(numbers)
           break
    else:
        # We never had a break so we've found a new group
        groups.append(([index], numbers))



Upvotes: 0

Loic RW
Loic RW

Reputation: 444

The following code should work:

import pandas as pd
import networkx as nx

df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3,4,4,4], 
                   'Number':[234, 43, 455, 112, 45, 234, 982, 41, 112, 46, 109, 4]})

df.ID = df.ID.astype(str)
g = nx.from_pandas_edgelist(df,'ID','Number')
connected_components = nx.connected_components(g)
d = {y:i for i, x in enumerate(connected_components) for y in x}

df['group'] = df.ID.replace(d)
df.ID = df.ID.astype(int)

The reason why we convert the ID column to a string is because we want 1 in the ID column to be distinct from 1 in the Number column.

If we didn't do that the following example would show only one group:

ID Number
1 3
2 3
3 4

To expand on the speed of the code, the following test resulted in an average time 3.4 seconds:

import pandas as pd
import numpy as np
import networkx as nx
import timeit

test_df = pd.DataFrame({'ID': np.random.randint(0, 10000, 10000), 
                        'Number': np.random.randint(0, 10000, 10000)})

def add_group(df):
    df.ID = df.ID.astype(str)
    g = nx.from_pandas_edgelist(df,'ID','Number')
    connected_components = nx.connected_components(g)
    d = {y:i for i, x in enumerate(connected_components) for y in x}

    df['group'] = df.ID.replace(d)
    df.ID = df.ID.astype(int)
    return df

timeit.timeit(lambda : add_group(test_df), number=100)

Upvotes: 2

Related Questions