Reputation: 315
Imagine we have a data set like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3,4,4,4], 'Number':[234, 43, 455, 112, 45, 234, 982, 41, 112, 46, 109, 4]})
Now, if you run the code above, you'll see that ID's 1,2 and 3 are all 'connected', i.e. ID 1 is connected to ID 2 via Number == 234, and ID 2 is connected to ID 3 via Number == 112. But ID 4 is not connected to these other 3 ID's. So I would like to have a df_final like this
df_final = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3,4,4,4], 'Number':[234, 43, 455, 112, 45, 234, 982, 41, 112, 46, 109, 4], 'Group':[1,1,1,1,1,1,1,1,1,2,2,2]})
My data set is large, so I need some code that is as efficient as possible. I defined an 'is_a_neighbor' function to check if there is overlap between two ID's. It returns 'True' if two ID's are connected (i.e. have an overlap). I suppose I'll need to embed it in a loop somehow. Here's the function:
def is_a_neighbor(id_1, id_2):
val = np.isin(df[df['ID'] == id_1][['Number']].to_numpy(), df[df['ID'] == id_2][['Number']].to_numpy()).any()
return val
and a test run
# Test is_a_neighbor function
id_1 = 1
id_2 = 2
val = is_a_neighbor(id_1, id_2)
val
Which will return 'True'. Thanks for any help/guidance on this one.
Upvotes: 1
Views: 219
Reputation: 134
As far as I can tell you have a long list of pairs (Id, Number) and two different Id's belong to the same Group if there is any number N such that both (i1, N) and (id2, N) exist.
I don't know why you think you need either panda or numpy for this
Why not something like:
data = [(1, 234), (1, 43), (1, 455), (1, 112), (2, 234), ...]
# Create sets of numbers for each index
number_sets = {}
for (index, number) in data:
# if we've not seen the index before we create an empty set
if index not in number_sets:
number_sets[index] = set()
number_sets[index].add(number)
Now we create a list of groups where for each group we record a list of indices and the set of a numbers found with any of those indices
groups = []
for (index, numbers) in number_sets.items():
for (index_list, superset) in groups:
# if any of the numbers are in the superset then this index belongs to this group
if not superset.isdisjoint(numbers):
index_list.append(index)
superset.join(numbers)
break
else:
# We never had a break so we've found a new group
groups.append(([index], numbers))
Upvotes: 0
Reputation: 444
The following code should work:
import pandas as pd
import networkx as nx
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3,4,4,4],
'Number':[234, 43, 455, 112, 45, 234, 982, 41, 112, 46, 109, 4]})
df.ID = df.ID.astype(str)
g = nx.from_pandas_edgelist(df,'ID','Number')
connected_components = nx.connected_components(g)
d = {y:i for i, x in enumerate(connected_components) for y in x}
df['group'] = df.ID.replace(d)
df.ID = df.ID.astype(int)
The reason why we convert the ID
column to a string is because we want 1
in the ID
column to be distinct from 1
in the Number
column.
If we didn't do that the following example would show only one group:
ID | Number |
---|---|
1 | 3 |
2 | 3 |
3 | 4 |
To expand on the speed of the code, the following test resulted in an average time 3.4 seconds:
import pandas as pd
import numpy as np
import networkx as nx
import timeit
test_df = pd.DataFrame({'ID': np.random.randint(0, 10000, 10000),
'Number': np.random.randint(0, 10000, 10000)})
def add_group(df):
df.ID = df.ID.astype(str)
g = nx.from_pandas_edgelist(df,'ID','Number')
connected_components = nx.connected_components(g)
d = {y:i for i, x in enumerate(connected_components) for y in x}
df['group'] = df.ID.replace(d)
df.ID = df.ID.astype(int)
return df
timeit.timeit(lambda : add_group(test_df), number=100)
Upvotes: 2