k3njiy
k3njiy

Reputation: 151

author relation analysis with python

So this is a big one:

I have a list with authors and coauthors of various publications. This list might look like this:

[[['A','uni'],[['B','uni'],['C','uni'],['D','uni'],['E','uni']]],

[['E','uni'],[['A','uni'],['F','uni'],['G','uni']]]]

So author A worked with authors B,C and D on a publication and author E worked with authors A,F and G on a nother publication.

what i need is a list of all authors even if they are just listed as coauthors (B,C,D,F,G) and with whom they wrote how many papers. So all main Authors (A and E) worked together with their coauthors (A with B,C,D,E; E with A,F,G) but also the coauthors of a paper worked together (B with C,D,E and also A and so on). on top of that i need to know on how many papers they worked together.

So the end result of this small example would be:

[[['A','uni'],[['B','uni',1],['C','uni',1],['D','uni',1],['E','uni',2],['F','uni',1],['G','uni',1]]],

[['B','uni'],[['A','uni',1],['C','uni',1],['D','uni',1],['E','uni',1]]],

[['C','uni'],[['A','uni',1],['B','uni',1],['D','uni',1],['E','uni',1]]],

[['D','uni'],[['A','uni',1],['B','uni',1],['C','uni',1],['E','uni',1]]],

[['E','uni'],[['A','uni',2],['B','uni',1],['C','uni',1],['D','uni',1],['F','uni',1],['G','uni',1]]],

[['F','uni'],[['A','uni',1],['E','uni',1],['G','uni',1]]],

[['G','uni'],[['A','uni',1],['E','uni',1],['F','uni',1]]]]

Okay, to be honest this is a little confusing but i hope you understand what I mean. (the uni entry stands for the university the author works for. Might still include other information but this should not be relevant for this task)

I have this intial list that I get with a python script I wrote to parse a database. I want to create a graph displaying wo wrote with whom and how often.

I was playing around for some time now and i just can't find a nice solution. I think I would be able to get write something that works, but it will not be nice or efficient and very time consuming. So is there a quick, pythonic way of solving this problem? My example now only has two publications but i have to analyse about 10000 publications and some of them have a few hundred coauthors...

Upvotes: 1

Views: 460

Answers (3)

kalgasnik
kalgasnik

Reputation: 3205

My version:

from collections import defaultdict
from collections import Counter
from itertools import chain

L = [[['A', 'uni'], [['B', 'uni'], ['C', 'uni'], ['D', 'uni'], ['E', 'uni']]], [['E', 'uni'], [['A', 'uni'], ['F', 'uni'], ['G', 'uni']]]]

d = defaultdict(Counter)
for publication in L:
    authors = [tuple(a) for a in chain([publication[0]], publication[1])]
    for author in authors:
        d[author].update(authors)

for k, v in d.iteritems():
    print k, [(author[0], author[1], counter)
              for author, counter in v.iteritems() if author[0] != k[0]]

Output:

('B', 'uni') [('A', 'uni', 1), ('D', 'uni', 1), ('E', 'uni', 1), ('C', 'uni', 1)]
('A', 'uni') [('B', 'uni', 1), ('F', 'uni', 1), ('D', 'uni', 1), ('G', 'uni', 1), ('E', 'uni', 2), ('C', 'uni', 1)]
('F', 'uni') [('A', 'uni', 1), ('G', 'uni', 1), ('E', 'uni', 1)]
('D', 'uni') [('A', 'uni', 1), ('B', 'uni', 1), ('E', 'uni', 1), ('C', 'uni', 1)]
('G', 'uni') [('A', 'uni', 1), ('E', 'uni', 1), ('F', 'uni', 1)]
('E', 'uni') [('B', 'uni', 1), ('A', 'uni', 2), ('F', 'uni', 1), ('D', 'uni', 1), ('G', 'uni', 1), ('C', 'uni', 1)]
('C', 'uni') [('A', 'uni', 1), ('D', 'uni', 1), ('B', 'uni', 1), ('E', 'uni', 1)]

Upvotes: 1

Michael
Michael

Reputation: 7736

You don't need a database, but you firstable need some data structure, to hold and represent all your information. I'll not write the full classes, just their important attributes.

class Author(object):
    name
    university        

class Publication(object):
    name
    date

class Authorship(object)
    author
    publication
    main_author(bool)

Next, you have to organize these Objects. Authors and Publications should be unique things, so you can put them each in ordinary dictionarys, if your data set is not exceeding some hundred MB. They have to be indexed by a unique attribute. If author.name is not sufficient for that, take a tuple of university and author name, or better birthday or something related to the author if available, as universities can change.

For authorship you should create different indices, so you can search faster without iterating over the whole list all the time. Maybe you want some defaultdict(list), indexed by authors containing their publications, and otherwise another defaultdict(list), indexed with the publications. Be careful to maintain consistency (duplicates, data errors can be cruel).

After that, you simply have to iterate over your dataset and fill your structure.

Upvotes: 1

John La Rooy
John La Rooy

Reputation: 304167

from collections import defaultdict

L = [[['A','uni'],[['B','uni'],['C','uni'],['D','uni'],['E','uni']]],
     [['E','uni'],[['A','uni'],['F','uni'],['G','uni']]]]

res = defaultdict(set)

for x, y in L:
    x = [tuple(x)]
    y = map(tuple, y)
    row = x+y
    for i in row:
        print set(row)
        res[i] |= set(row)

for k, v in res.items():
    v.remove(k)
    print k, list(v)

outputs:

('B', 'uni') [('A', 'uni'), ('D', 'uni'), ('E', 'uni'), ('C', 'uni')]
('A', 'uni') [('B', 'uni'), ('F', 'uni'), ('D', 'uni'), ('G', 'uni'), ('E', 'uni'), ('C', 'uni')]
('F', 'uni') [('A', 'uni'), ('G', 'uni'), ('E', 'uni')]
('D', 'uni') [('A', 'uni'), ('B', 'uni'), ('E', 'uni'), ('C', 'uni')]
('G', 'uni') [('A', 'uni'), ('E', 'uni'), ('F', 'uni')]
('E', 'uni') [('B', 'uni'), ('A', 'uni'), ('F', 'uni'), ('D', 'uni'), ('G', 'uni'), ('C', 'uni')]
('C', 'uni') [('A', 'uni'), ('D', 'uni'), ('B', 'uni'), ('E', 'uni')]

Upvotes: 1

Related Questions