Get network edges from SQL tables for networkX in python

Question

I am trying to get the number of network edges from a normalised SQLite database which has been normalised as follows:

 Authors                    Paper                      Paper_Authors
 | authorID | name | etc    | paperID | title | etc    | paperID  | authorID |
 |    1     | .... | ...    |    1    | ..... | ...    |    1     |    1     |
 |    2     | .... | ...    |    2    | ..... | ...    |    1     |    2     |
 |    3     | .... | ...    |    .    | ..... | ...    |    1     |    3     | 
 |    4     | .... | ...    |  60,000 | ..... | ...    |    2     |    1     |
 |    5     | .... | ...                               |    2     |    4     |
 |    .     | .... | ...                               |    2     |    5     |
 | 120,000  | .... | ...                               |    .     |    .     |
                                                       |  60,000  | 120,000  |

With somewhere in the region of 120,000 authors and 60,000 papers, and the index table has around 250,000 rows.

I am trying to get this into networkX to do some connectivity analysis, inputting the nodes is simple:

 conn = sqlite3.connect('../input/database.sqlite')
 c = conn.cursor()
 g = nx.Graph()
 c.execute('SELECT authorID FROM Authors;')
 authors = c.fetchall()
 g.add_nodes_from(authors)

The problem I am having arises from trying to determine the edges to feed to networkX, which requires the values in a tuple of the two nodes to connect, using the data above as an example;

 [(1,1),(1,2),(1,3),(2,3),(1,4),(1,5),(4,5)]

Would describe the dataset above.

I have the following code, which works, but is inelegant:

 def coauthors(pID):
     c.execute('SELECT authorID \
                FROM Paper_Authors \
                WHERE paperID IS ?;', (pID,))
     out = c.fetchall()
     g.add_edges_from(itertools.product(out, out))

 c.execute('SELECT COUNT() FROM Papers;')
 papers = c.fetchall()

 for i in range(1, papers[0][0]+1):
     if i % 1000 == 0:
         print('On record:', str(i))
     coauthors(i)

This works by looping through each of the papers in the database, returning a list of authors and iteratively making list of author combination tuples and adding them to the network in a piecemeal way, which works, but took 30 - 45 minutes:

 print(nx.info(g))
 Name: 
 Type: Graph
 Number of nodes: 120670
 Number of edges: 697389
 Average degree:  11.5586

So my question is, is there a more elegant way to come to the same result, ideally with the paperID as the edge label, to make it easier to navigate the the network outside of networkX.

CL. · Accepted Answer

You can get all combinations of authors for each paper with a self join:

SELECT paperID,
       a1.authorID AS author1,
       a2.authorID AS author2
FROM Paper_Authors AS a1
JOIN Paper_Authors AS a2 USING (paperID)
WHERE a1.authorID < a2.authorID;         -- prevent duplicate edges

This will be horribly inefficient unless you have an index on paperID, or better, a covering index on both paperID and authorID, or better, a WITHOUT ROWID table.

Get network edges from SQL tables for networkX in python

Answers (1)

Related Questions