Reputation: 168
I am trying to get the number of network edges from a normalised SQLite database which has been normalised as follows:
Authors Paper Paper_Authors
| authorID | name | etc | paperID | title | etc | paperID | authorID |
| 1 | .... | ... | 1 | ..... | ... | 1 | 1 |
| 2 | .... | ... | 2 | ..... | ... | 1 | 2 |
| 3 | .... | ... | . | ..... | ... | 1 | 3 |
| 4 | .... | ... | 60,000 | ..... | ... | 2 | 1 |
| 5 | .... | ... | 2 | 4 |
| . | .... | ... | 2 | 5 |
| 120,000 | .... | ... | . | . |
| 60,000 | 120,000 |
With somewhere in the region of 120,000 authors and 60,000 papers, and the index table has around 250,000 rows.
I am trying to get this into networkX to do some connectivity analysis, inputting the nodes is simple:
conn = sqlite3.connect('../input/database.sqlite')
c = conn.cursor()
g = nx.Graph()
c.execute('SELECT authorID FROM Authors;')
authors = c.fetchall()
g.add_nodes_from(authors)
The problem I am having arises from trying to determine the edges to feed to networkX, which requires the values in a tuple of the two nodes to connect, using the data above as an example;
[(1,1),(1,2),(1,3),(2,3),(1,4),(1,5),(4,5)]
Would describe the dataset above.
I have the following code, which works, but is inelegant:
def coauthors(pID):
c.execute('SELECT authorID \
FROM Paper_Authors \
WHERE paperID IS ?;', (pID,))
out = c.fetchall()
g.add_edges_from(itertools.product(out, out))
c.execute('SELECT COUNT() FROM Papers;')
papers = c.fetchall()
for i in range(1, papers[0][0]+1):
if i % 1000 == 0:
print('On record:', str(i))
coauthors(i)
This works by looping through each of the papers in the database, returning a list of authors and iteratively making list of author combination tuples and adding them to the network in a piecemeal way, which works, but took 30 - 45 minutes:
print(nx.info(g))
Name:
Type: Graph
Number of nodes: 120670
Number of edges: 697389
Average degree: 11.5586
So my question is, is there a more elegant way to come to the same result, ideally with the paperID as the edge label, to make it easier to navigate the the network outside of networkX.
Upvotes: 3
Views: 2784
Reputation: 180200
You can get all combinations of authors for each paper with a self join:
SELECT paperID,
a1.authorID AS author1,
a2.authorID AS author2
FROM Paper_Authors AS a1
JOIN Paper_Authors AS a2 USING (paperID)
WHERE a1.authorID < a2.authorID; -- prevent duplicate edges
This will be horribly inefficient unless you have an index on paperID
, or better, a covering index on both paperID
and authorID
, or better, a WITHOUT ROWID table.
Upvotes: 2