Is there any way to make this code faster?

Question

I have a pandas dataframe containing the details of some papers with approximately 4 million records. I need to find the top 50 authors that have the largest number of publications within this data set. I have two files containing this data so I've to read both of them into dataframes and append them together to get a single dataframe to work with. I've only used the author column from the dataframe as there are 32 other columns that are not required.

So far, I've come up with the following solution. Also, this is an algorithms assignment so I cannot use any inbuilt algorithms. Currently, I'm using a dictionary to store the publication count of every author and then I'm looping over the dict to get the most published authors. Also, there can be multiple authors within a single row like 'Auth 1 | Auth 2 | Auth 3|' which is why I'm splitting the string.

I wanted to know if there is a faster method to accomplish all of this. Is there some way that I can find the maximum during my loop of the dataframe? Again, I'm not permitted to use inbuilt algorithms for searching or sorting. Any suggestions would help.

Thank you.

start_time = ti.default_timer()

only_authors_article = pd.DataFrame(articles['author'])
only_authors_inproceedings = pd.DataFrame(proceedings['author'])
all_authors = only_authors_article.append(only_authors_inproceedings, ignore_index = True)
all_authors = all_authors.dropna(how = 'any')

auth_dict = defaultdict(int)

for auth_list in zip(all_authors['author']):
    auth_list = auth_list[0]

    if '|' in auth_list:
        auths = auth_list.split('|')

        for auth in auths:
            auth_dict[auth] += 1
    else:
        auth_dict[auth_list] += 1


most_pub_authors = []

for i in range(0, 50):
    max_pub_count = 0
    max_pub_auth = None

    for author, pub_count in auth_dict.items(): 
        if pub_count > max_pub_count:
            max_pub_count = pub_count
            max_pub_auth = author

    most_pub_authors.append( (max_pub_auth, max_pub_count) ) 
    del auth_dict[max_pub_auth]

print(most_pub_authors) 


elapsed_time = ti.default_timer() - start_time
print("Total time taken: " + str(elapsed_time))

Edit 1: Some sample data from all_authors

    author
0   Sanjeev Saxena
1   Hans Ulrich Simon
2   Nathan Goodman|Oded Shmueli
3   Norbert Blum
4   Arnold Schönhage
5   Juha Honkala
6   Christian Lengauer|Chua-Huang Huang
7   Alain Finkel|Annie Choquet
8   Joachim Biskup
9   George Rahonis|Symeon Bozapalidis|Zoltán Fülöp...
10  Alex Kondratyev|Maciej Koutny|Victor Khomenko|...
11  Wim H. Hesselink
12  Christian Ronse
13  Carol Critchlow|Prakash Panangaden
14  Fatemeh Ghassemi|Ramtin Khosravi|Rosa Abbasi
15  Robin Milner
16  John Darlington
17  Giuseppe Serazzi|M. Italiani|Maria Calzarossa
18  Vincent Vajnovszki
19  Christian Stahl|Richard Müller 0001|Walter Vogler
20  Luc Devroye
21  K. C. Tan|T. C. Hu
22  William R. Franta
23  Ekkart Kindler
24  Demetres D. Kouvatsos
25  Christian Lengauer|Sergei Gorlatch
26  Roland Meyer
27  Stefan Reisch
28  Erzsébet Csuhaj-Varjú|Victor Mitrana
29  Lila Kari|Manasi S. Kulkarni

rustyx · Accepted Answer

The problem is with this part:

for i in range(0, 50):
    . . .
    for author, pub_count in auth_dict.items(): 
        . . .

That iterates over the entire dataset 50 times.

What you can do instead is the accumulator approach: have a list of top-50 authors, initially populate it by the first 50 authors and iterate over auth_dict once, replacing the lowest element if you found one that is higher than that.

Something like this:

top_authors = []
lowest_pub_count = 0
top_n = 50
for author, pub_count in auth_dict.items():
    if pub_count > lowest_pub_count:        # found element that is larger than the smallest in top-N so far
        if len(top_authors) < top_n:        # not reached N yet - just append to the list
            top_authors.append([author, pub_count])
            if len(top_authors) < top_n:    # keep lowest_pub_count at 0 until N is reached
                continue
        else:                               # replace the lowest element with the found one
            for i in range(len(top_authors)):
                if top_authors[i][1] == lowest_pub_count:
                    top_authors[i] = [author, pub_count]
                    break
        lowest_pub_count = pub_count
        for i in range(len(top_authors)):   # find the new lowest element
            if top_authors[i][1] < lowest_pub_count:
                lowest_pub_count = top_authors[i][1]

The sequential search for the lowest element in the top-50 list is amortized by the fact that it is done infrequently.

Is there any way to make this code faster?

Answers (2)

Related Questions