Reputation: 429
I have a pandas dataframe containing the details of some papers with approximately 4 million records. I need to find the top 50 authors that have the largest number of publications within this data set. I have two files containing this data so I've to read both of them into dataframes and append them together to get a single dataframe to work with. I've only used the author column from the dataframe as there are 32 other columns that are not required.
So far, I've come up with the following solution. Also, this is an algorithms assignment so I cannot use any inbuilt algorithms. Currently, I'm using a dictionary to store the publication count of every author and then I'm looping over the dict to get the most published authors. Also, there can be multiple authors within a single row like 'Auth 1 | Auth 2 | Auth 3|' which is why I'm splitting the string.
I wanted to know if there is a faster method to accomplish all of this. Is there some way that I can find the maximum during my loop of the dataframe? Again, I'm not permitted to use inbuilt algorithms for searching or sorting. Any suggestions would help.
Thank you.
start_time = ti.default_timer()
only_authors_article = pd.DataFrame(articles['author'])
only_authors_inproceedings = pd.DataFrame(proceedings['author'])
all_authors = only_authors_article.append(only_authors_inproceedings, ignore_index = True)
all_authors = all_authors.dropna(how = 'any')
auth_dict = defaultdict(int)
for auth_list in zip(all_authors['author']):
auth_list = auth_list[0]
if '|' in auth_list:
auths = auth_list.split('|')
for auth in auths:
auth_dict[auth] += 1
else:
auth_dict[auth_list] += 1
most_pub_authors = []
for i in range(0, 50):
max_pub_count = 0
max_pub_auth = None
for author, pub_count in auth_dict.items():
if pub_count > max_pub_count:
max_pub_count = pub_count
max_pub_auth = author
most_pub_authors.append( (max_pub_auth, max_pub_count) )
del auth_dict[max_pub_auth]
print(most_pub_authors)
elapsed_time = ti.default_timer() - start_time
print("Total time taken: " + str(elapsed_time))
Edit 1: Some sample data from all_authors
author
0 Sanjeev Saxena
1 Hans Ulrich Simon
2 Nathan Goodman|Oded Shmueli
3 Norbert Blum
4 Arnold Schönhage
5 Juha Honkala
6 Christian Lengauer|Chua-Huang Huang
7 Alain Finkel|Annie Choquet
8 Joachim Biskup
9 George Rahonis|Symeon Bozapalidis|Zoltán Fülöp...
10 Alex Kondratyev|Maciej Koutny|Victor Khomenko|...
11 Wim H. Hesselink
12 Christian Ronse
13 Carol Critchlow|Prakash Panangaden
14 Fatemeh Ghassemi|Ramtin Khosravi|Rosa Abbasi
15 Robin Milner
16 John Darlington
17 Giuseppe Serazzi|M. Italiani|Maria Calzarossa
18 Vincent Vajnovszki
19 Christian Stahl|Richard Müller 0001|Walter Vogler
20 Luc Devroye
21 K. C. Tan|T. C. Hu
22 William R. Franta
23 Ekkart Kindler
24 Demetres D. Kouvatsos
25 Christian Lengauer|Sergei Gorlatch
26 Roland Meyer
27 Stefan Reisch
28 Erzsébet Csuhaj-Varjú|Victor Mitrana
29 Lila Kari|Manasi S. Kulkarni
Upvotes: 1
Views: 124
Reputation: 85530
The problem is with this part:
for i in range(0, 50):
. . .
for author, pub_count in auth_dict.items():
. . .
That iterates over the entire dataset 50 times.
What you can do instead is the accumulator approach: have a list of top-50 authors, initially populate it by the first 50 authors and iterate over auth_dict
once, replacing the lowest element if you found one that is higher than that.
Something like this:
top_authors = []
lowest_pub_count = 0
top_n = 50
for author, pub_count in auth_dict.items():
if pub_count > lowest_pub_count: # found element that is larger than the smallest in top-N so far
if len(top_authors) < top_n: # not reached N yet - just append to the list
top_authors.append([author, pub_count])
if len(top_authors) < top_n: # keep lowest_pub_count at 0 until N is reached
continue
else: # replace the lowest element with the found one
for i in range(len(top_authors)):
if top_authors[i][1] == lowest_pub_count:
top_authors[i] = [author, pub_count]
break
lowest_pub_count = pub_count
for i in range(len(top_authors)): # find the new lowest element
if top_authors[i][1] < lowest_pub_count:
lowest_pub_count = top_authors[i][1]
The sequential search for the lowest element in the top-50 list is amortized by the fact that it is done infrequently.
Upvotes: 1
Reputation: 225273
auth_dict = defaultdict(int)
for auth_list in zip(all_authors['author']):
auth_list = auth_list[0]
if '|' in auth_list:
auths = auth_list.split('|')
for auth in auths:
auth_dict[auth] += 1
else:
auth_dict[auth_list] += 1
is a complicated way to write
auth_dict = defaultdict(int)
for auth_list in all_authors['author']:
for auth in auth_list.split('|'):
auth_dict[auth] += 1
and this might be faster:
Counter(itertools.chain.from_iterable(
auth_list.split('|') for auth_list in all_authors['author']))
where itertools
is import itertools
and Counter
is from collections import Counter
.
most_pub_authors = []
for i in range(0, 50):
max_pub_count = 0
max_pub_auth = None
for author, pub_count in auth_dict.items():
if pub_count > max_pub_count:
max_pub_count = pub_count
max_pub_auth = author
most_pub_authors.append( (max_pub_auth, max_pub_count) )
del auth_dict[max_pub_auth]
print(most_pub_authors)
goes through the whole dict a fair number of times. Try one pass:
most_pub_authors = heapq.nlargest(50, auth_dict.items(), key=itemgetter(1))
where itemgetter
is from operator import itemgetter
.
Upvotes: 1