Reputation: 61
I've just started coding and am trying to understand how NetworkX works. I have a Pandas DataFrame with columns of documents and topics. The topics
columns indicate whether a topic is present in each document (row).
df = pd.DataFrame({'DOC': ['Doc_A', 'Doc_B', 'Doc_C', 'Doc_D', 'Doc_E'], 'topic_A': [0,0,1,0,0], 'topic_B': [1,0,0,1,0], 'topic_C': [0,1,1,1,0]})
DOC topic_A topic_B topic_C
0 Doc_A 0 1 0
1 Doc_B 0 0 1
2 Doc_C 1 0 1
3 Doc_D 0 1 1
4 Doc_E 0 0 0
What I'd like to do is create networks in which:
1) The documents are the nodes and the edges are the topics (no weight), so with multiple edges for the same node.
2) The documents are the nodes and the edges are the topics, but instead of having multiple edges, the edges are weighted based on how many subjects they share in common.
How can I do this? Am I even thinking correctly here?
Upvotes: 4
Views: 6088
Reputation: 57033
Here's how you can build a network in which the co-occurrences of topics in docs are represented as edges:
Start by making DOC the index and stacking the dataframe. You get a linear representation of your table:
stacked = df.set_index('DOC').stack()
#DOC
#Doc_A topic_A 0
# topic_B 1
# topic_C 0
#...
Surely, you want only the rows that have 1s because a 1 means that a topic and a document are connected:
stacked = stacked[stacked==1]
The multi-index of this table is actually an edge list:
edges = stacked.index.tolist()
#[('Doc_A', 'topic_B'), ('Doc_B', 'topic_C'), ('Doc_C', 'topic_A'),
# ('Doc_C', 'topic_C'), ('Doc_D', 'topic_B'), ('Doc_D', 'topic_C')]
Let's make a network out of it. The new graph is bipartite. You can project it to keep the topicx but discard the documentx - or the other way around:
G = nx.Graph(edges)
Gp = nx.bipartite.project(G,df.set_index('DOC').columns)
# or
# nx.bipartite.project(G,df.set_index('DOC').index)
Gp.edges()
#EdgeView([('topic_A', 'topic_C'), ('topic_B', 'topic_C')])
Followed by a shameless piece of self-promotion.
Upvotes: 3