Reputation: 782
I have a pandas DataFrame containing rows of nodes that I ultimately would like to connect and turn into a graph like object. For this, I first thought of converting this DataFrame to something that resembles an adjacency list, to later on easily create a graph from this. I have the following:
A pandas Dataframe:
df = pd.DataFrame({"id": [0, 1, 2, 3, 4, 5, 6],
"start": ["A", "B", "D", "A", "X", "F", "B"],
"end": ["B", "C", "F", "G", "X", "X", "E"],
"cases": [["c1", "c2", "c44"], ["c2", "c1", "c3"], ["c4"], ["c1", ], ["c1", "c7"], ["c4"], ["c44", "c7"]]})
which looks like this:
id start end cases
0 0 A B [c1, c2, c44]
1 1 B C [c2, c1, c3]
2 2 D F [c4]
3 3 A G [c1]
4 4 X X [c1, c7]
5 5 F X [c4]
6 6 B E [c44, c7]
A function directly_follows(i, j)
that returns true if the node in row i
is followed by the node in row j
(this wil later be a directed edge in a graph from node i
to node j
):
def directly_follows(row1, row2):
return close(row1, row2) and case_overlap(row1, row2)
def close(row1, row2):
return row1["end"] == row2["start"]
def case_overlap(row1, row2):
return not set(row1["cases"]).isdisjoint(row2["cases"])
Shortly, node i
is followed by node j
if the end
value of node i
is the same as the start
value of node j
and if their cases
overlap
Based on this directly_follows
function, I want to create an extra column to my DataFrame df
which acts as an adjacency list, containing for node i
a list with the id
values of nodes that follow i
My desired result would thus be:
id start end cases adjacency_list
0 0 A B [c1, c2, c44] [1, 6]
1 1 B C [c2, c1, c3] []
2 2 D F [c4] [5]
3 3 A G [c1] []
4 4 X X [c1, c7] []
5 5 F X [c4] []
6 6 B E [c44, c7] []
Basically I thought of first creating the column adjacency_list as empty lists, and then looping through the rows of the Dataframe and if for row i
and j
directly_follows(row_i, row_j) returns True, add the id of j
to the adjacency list of i
.
I did it like this:
def connect(data):
data["adjacency_list"] = np.empty((len(data), 0)).tolist()
for i in range(len(data)):
for j in range(len(data)):
if i != j:
if directly_follows(data.iloc[i], data.iloc[j]):
data.iloc[i]["adjacency_list"] = data.iloc[i]["adjacency_list"].append(data.iloc[i]["id"])
Now first, this returns an error
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
And secondly, I highly doubt this is the most pythonic and efficient way to solve this problem, since my actual DataFrame consists of about 9000 rows, which would give around 81 million comparisons.
How to create the adjacency list in the least time consuming way? Is there maybe a faster or more elegant solution than mine?
Upvotes: 3
Views: 748
Reputation: 35646
A self-join option:
df['adjacency_list'] = df.apply(lambda s: df[(df['start'] == s.end) &
(df['id'] != s.id)].index.tolist(), axis=1)
print(df)
Output:
id start end cases adjacency_list
0 0 A B [c1, c2, c44] [1, 6]
1 1 B C [c2, c1, c3] []
2 2 D F [c4] [5]
3 3 A G [c1] []
4 4 X X [c1, c7] []
5 5 F X [c4] [4]
6 6 B E [c44, c7] []
Upvotes: 2
Reputation: 14516
One option would be to apply the following function - it's not completely vectorised because Dataframes don't particularly like embedding mutable objects like lists, and I don't think you can apply set operations in a vectorised way. It does cut down the number of comparisons needed though.
def f(x):
check = df[(x["end"] == df["start"])]
return [
row["id"]
for i, row in check.iterrows()
if not set(row["cases"]).isdisjoint(x["cases"])
]
df["adjacency_list"] = df.apply(f, axis=1)
Or, as a big lambda function:
df["adjacency_list"] = df.apply(
lambda x: [
row["id"]
for i, row in df[(x["end"] == df["start"])].iterrows()
if not set(row["cases"]).isdisjoint(x["cases"])
],
axis=1,
)
id start end cases adjacency_list
0 0 A B [c1, c2, c44] [1, 6]
1 1 B C [c2, c1, c3] []
2 2 D F [c4] [5]
3 3 A G [c1] []
4 4 X X [c1, c7] [4]
5 5 F X [c4] []
6 6 B E [c44, c7] []
Upvotes: 1
Reputation: 14949
TRY:
k=0
def test(x):
global k
k+=1
test_df = df[k:]
return list(test_df[test_df['start'] == x].index)
df['adjancy_matrix'] = df.end.apply(test,1)
OUTPUT:
id start end cases adjancy_matrix
0 0 A B [c1,c2,c44] [1, 6]
1 1 B C [c2,c1,c3] []
2 2 D F [c4] [5]
3 3 A G [c1] []
4 4 X X [c1,c7] []
5 5 F X [c4] []
6 6 B E [c44,c7] []
Upvotes: 1