Robert Lucas
Robert Lucas

Reputation: 81

Optimising counting occurrences in a list

I need to find the most popular pub names in a .CVS file. The way I'm doing it now is by going through the list of pub names to check if it's already there and if yes, adding one to a secondary value and if no, adding it. i.e

pub_names = [["Inn",1]]
add "Inn"
pub_names = [["Inn",2]]
add "Pub"
pub_names = [["Inn,2"]["Pub",1]]

(I'll sort them by size later)
The problem is that this is incredibly slow as I have 50,000 entries and was wondering if there's a way to optimise it. The second time it check 1 entry to see if the name is a repeat but the 20,000th it checks 19,999 for 20,001 20,000 and so on.

import csv
data = list(csv.reader(open("open_pubs.csv")))
iterdata = iter(data)
next(iterdata)
pub_names = []
for x in iterdata:
    for i in pub_names:
        if x[1] == i[0]:
            i[1] += 1

        full_percent = (data.index(x) / len(data))*100
        sub_percent = (pub_names.index(i) / len(pub_names))*100
        print("Total =",str(full_percent)+"%","Sub =",str(sub_percent)+"%")

    else:
        pub_names += [[x[1],1]]

CSV file: https://www.kaggle.com/rtatman/every-pub-in-england#open_pubs.csv

Upvotes: 0

Views: 38

Answers (2)

Tim
Tim

Reputation: 3427

What you can do is to load it into a dataframe and then do a groupby + count.

This load all the data at once. Then, it counts the number of occurrences.

import pandas as pd

df = pd.read_csv('path_to_csv')
df2 = df.groupby('Inn Name')['Inn Name'].count()

This will be faster than any loop since dataframe methods are vectorized.

Upvotes: 0

John Gordon
John Gordon

Reputation: 33359

Dictionaries provide much faster element access, and cleaner code in general:

pubs = {
    "Inn": 2,
    "Pub": 1
}

pubname = "Tavern"
if pubname in pubs:
    pubs[pubname] += 1
else:
    pubs[pubname] = 1

Upvotes: 1

Related Questions