Reputation: 81
I need to find the most popular pub names in a .CVS file. The way I'm doing it now is by going through the list of pub names to check if it's already there and if yes, adding one to a secondary value and if no, adding it. i.e
pub_names = [["Inn",1]]
add "Inn"
pub_names = [["Inn",2]]
add "Pub"
pub_names = [["Inn,2"]["Pub",1]]
(I'll sort them by size later)
The problem is that this is incredibly slow as I have 50,000 entries and was wondering if there's a way to optimise it. The second time it check 1 entry to see if the name is a repeat but the 20,000th it checks 19,999 for 20,001 20,000 and so on.
import csv
data = list(csv.reader(open("open_pubs.csv")))
iterdata = iter(data)
next(iterdata)
pub_names = []
for x in iterdata:
for i in pub_names:
if x[1] == i[0]:
i[1] += 1
full_percent = (data.index(x) / len(data))*100
sub_percent = (pub_names.index(i) / len(pub_names))*100
print("Total =",str(full_percent)+"%","Sub =",str(sub_percent)+"%")
else:
pub_names += [[x[1],1]]
CSV file: https://www.kaggle.com/rtatman/every-pub-in-england#open_pubs.csv
Upvotes: 0
Views: 38
Reputation: 3427
What you can do is to load it into a dataframe and then do a groupby + count.
This load all the data at once. Then, it counts the number of occurrences.
import pandas as pd
df = pd.read_csv('path_to_csv')
df2 = df.groupby('Inn Name')['Inn Name'].count()
This will be faster than any loop since dataframe methods are vectorized.
Upvotes: 0
Reputation: 33359
Dictionaries provide much faster element access, and cleaner code in general:
pubs = {
"Inn": 2,
"Pub": 1
}
pubname = "Tavern"
if pubname in pubs:
pubs[pubname] += 1
else:
pubs[pubname] = 1
Upvotes: 1