Reputation: 567
I am trying to use Torch for Label Propagation. I have a dataframe that looks like
ID Target Weight Label
1 12 0.4 1
2 24 0.1 0
4 13 0.5 1
4 12 0.3 1
12 1 0.1 1
12 4 0.4 1
13 4 0.2 1
17 1 0.1 0
and so on.
I built the network as follows:
G = nx.from_pandas_edgelist(df, source='ID', target='Target', edge_attr=['Weight'])
and the adjacency matrix
adj_matrix = nx.adjacency_matrix(G).toarray()
I have two labels only, 0 and 1, and some data unlabelled. I created input tensors as follows:
# Create input tensors
adj_matrix_t = torch.FloatTensor(adj_matrix)
labels_t = torch.LongTensor(df['Labels'].tolist())
Trying to run the following code
# Learn with Label Propagation
label_propagation = LabelPropagation(adj_matrix_t)
label_propagation.fit(labels_t) # this is causing the error
I have got the error: IndexError: The shape of the mask [196] at index 0 does not match the shape of the indexed tensor [207] at index 0
.
I checked the size size of adj_matrix_t.shape
which is currently (207,207), while labels are 196.
Do you know how I can fix this inconsistency?
Please see below the error track:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-42-cf4f88a4bb12> in <module>
2 label_propagation = LabelPropagation(adj_matrix_t)
3 print("Label Propagation: ", end="")
----> 4 label_propagation.fit(labels_t)
5 label_propagation_output_labels = label_propagation.predict_classes()
6
<ipython-input-1-54a7dbc30bd1> in fit(self, labels, max_iter, tol)
100
101 def fit(self, labels, max_iter=1000, tol=1e-3):
--> 102 super().fit(labels, max_iter, tol)
103
104 ## Label spreading
<ipython-input-1-54a7dbc30bd1> in fit(self, labels, max_iter, tol)
58 Convergence tolerance: threshold to consider the system at steady state.
59 """
---> 60 self._one_hot_encode(labels)
61
62 self.predictions = self.one_hot_labels.clone()
<ipython-input-1-54a7dbc30bd1> in _one_hot_encode(self, labels)
43 self.one_hot_labels = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)
44 self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)
---> 45 self.one_hot_labels[unlabeled_mask, 0] = 0
46
47 self.labeled_mask = ~unlabeled_mask
The below code is an example of what I would like to use for label propagation. It seems that the error is due to labels. There are nodes in my dataset not having labels (though in my example above I wrote for all the labels). Might it be the case that this is causing the error message?
Original code (for reference: https://mybinder.org/v2/gh/thibaudmartinez/label-propagation/master?filepath=notebook.ipynb):
## Testing models on synthetic data
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
# Create caveman graph
n_cliques = 4
size_cliques = 5
caveman_graph = nx.connected_caveman_graph(n_cliques, size_cliques)
adj_matrix = nx.adjacency_matrix(caveman_graph).toarray()
# Create labels
labels = np.full(n_cliques * size_cliques, -1.)
# Only one node per clique is labeled. Each clique belongs to a different class.
labels[0] = 0
labels[size_cliques] = 1
labels[size_cliques * 2] = 2
labels[size_cliques * 3] = 3
# Create input tensors
adj_matrix_t = torch.FloatTensor(adj_matrix)
labels_t = torch.LongTensor(labels)
# Learn with Label Propagation
label_propagation = LabelPropagation(adj_matrix_t)
print("Label Propagation: ", end="")
label_propagation.fit(labels_t)
label_propagation_output_labels = label_propagation.predict_classes()
# Learn with Label Spreading
label_spreading = LabelSpreading(adj_matrix_t)
print("Label Spreading: ", end="")
label_spreading.fit(labels_t, alpha=0.8)
label_spreading_output_labels = label_spreading.predict_classes()
# Plot graphs
color_map = {-1: "orange", 0: "blue", 1: "green", 2: "red", 3: "cyan"}
input_labels_colors = [color_map[l] for l in labels]
lprop_labels_colors = [color_map[l] for l in label_propagation_output_labels.numpy()]
lspread_labels_colors = [color_map[l] for l in label_spreading_output_labels.numpy()]
plt.figure(figsize=(14, 6))
ax1 = plt.subplot(1, 4, 1)
ax2 = plt.subplot(1, 4, 2)
ax3 = plt.subplot(1, 4, 3)
ax1.title.set_text("Raw data (4 classes)")
ax2.title.set_text("Label Propagation")
ax3.title.set_text("Label Spreading")
pos = nx.spring_layout(G)
nx.draw(G, ax=ax1, pos=pos, node_color=input_labels_colors, node_size=50)
nx.draw(G, ax=ax2, pos=pos, node_color=lprop_labels_colors, node_size=50)
nx.draw(G, ax=ax3, pos=pos, node_color=lspread_labels_colors, node_size=50)
# Legend
ax4 = plt.subplot(1, 4, 4)
ax4.axis("off")
legend_colors = ["orange", "blue", "green", "red", "cyan"]
legend_labels = ["unlabeled", "class 0", "class 1", "class 2", "class 3"]
dummy_legend = [ax4.plot([], [], ls='-', c=c)[0] for c in legend_colors]
plt.legend(dummy_legend, legend_labels)
plt.show()
Of course, if my example of dataset at the top of this post should not suit the original code because of the labels, if you could give me another example in order to understand how the labels (which determine the classes of nodes) in dataset should look like (even with missing values to be predicted), it would greatly appreciated it.
Upvotes: 2
Views: 4525
Reputation: 2252
For other readers here, it seems like this is the implementation being asked about in this question.
The method you are using to try to predict labels works with labels for nodes, not edges. To visualize this, I plotted your example data and colored the plot by your Weight
and Label
columns (code to produce plot appended below) where Weight
is the line thickness of the edge and Label
is the color:
In order to use this method, you will need to produce data that looks like this, where each node (denoted by ID
) gets exactly one node_label
:
ID node_label
1 1
2 0
4 1
12 1
13 1
17 0
To be clear, you will still need your original data above to build the network and the adjacency matrix, but you will have to decide some logical rule to turn your edge labels into node labels. Then once you predict your unlabeled nodes, you can reverse your rule to obtain edge labels if necessary.
It's not a strictly rigorous method, but it's practical and likely to yield somewhat sensible results if your data isn't just random noise.
Code appendix:
# Sample data network plot
import networkx as nx
import pandas as pd
data = {'ID': {0: 1, 1: 2, 2: 4, 3: 4, 4: 12, 5: 12, 6: 13, 7: 17},
'Target': {0: 12, 1: 24, 2: 13, 3: 12, 4: 1, 5: 4, 6: 4, 7: 1},
'Weight': {0: 0.4, 1: 0.1, 2: 0.5, 3: 0.3, 4: 0.1, 5: 0.4, 6: 0.2, 7: 0.1},
'Label': {0: 1, 1: 0, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 0}}
df = pd.DataFrame.from_dict(data)
G = nx.from_pandas_edgelist(df, source='ID', target='Target', edge_attr=['Weight', 'Label'])
width = [20 * d['Weight'] for (u, v, d) in G.edges(data=True)]
edge_color = [d['Label'] for (u, v, d) in G.edges(data=True)]
nx.draw_networkx(G, width=width, edge_color=edge_color)
Upvotes: 1