Reputation: 3
Basically, I am new to data analysis and I got a dataset that I would like to analyze and get some graphs to test out hypotheses and learn more about the data I got about the Olympics.
Now, I would like to find out which age gets the most gold, silver, and bronze medals and the same goes for height.
This is the code I have created, I think it works (i am not sure) but takes like 20 minutes to process and the format is weird which gives me trouble putting in a graph. I would like to know how I can cut the processing time significantly shorter and how I would be able to graph it ->
#calculating number of medals each person has
j=0
i=0
height_gold=[0]*230
height_silver=[0]*230
height_bronze=[0]*230
while(i<271116):
while(j<230):
if df.iloc[i,4]==j:
if df.iloc[i,14]=='Gold':
height_gold[j]=height_gold[j]+1
if df.iloc[i,14]=='Silver':
height_silver[j]=height_silver[j]+1
if df.iloc[i,14]=='Bronze':
height_bronze[j]=height_bronze[j]+1
j=j+1
#print('new_age')
i=i+1
j=0
#print('new_row')
print(height_gold)
print(height_silver)
print(height_bronze)
Also, I would very much like to know how I would be able to find out which sport gets the most medals, which Olympic year gave out the most medals, and which country gets the most medals.
Now that I am here, I would also like to ask what else I could find out from this csv.file here ->
the CSV file/data I am using to get data to plot a graph
Upvotes: 0
Views: 79
Reputation: 1180
I have some improvements that would significantly speed things up:
df.iloc[i,4]
as index in the arrays. With these, you get the following code:
medals={'Gold': [0]*230,
'Silver': [0]*230,
'Bronze': [0]*230}
for i in range(271116):
country = int(df.iloc[i,4])
medals[df.iloc[i,14]][country] += 1
print(medals['Gold'])
print(medals['Silver'])
print(medals['Bronze'])
Upvotes: 1
Reputation: 4446
The problem is that you’re finding j
in an inefficient way (checking all 230 possibilities). You can just set j
to df.iloc[i, 4]
. You’ve also built your own for
loop, which I’ve fixed here as well.
height_gold = [0] * 230
height_silver = [0] * 230
height_bronze = [0] * 230
for i in range(271116):
j = df.iloc[i, 4]
if df.iloc[i, 14] == 'Gold':
height_gold[j] += 1
elif df.iloc[i, 14] == 'Silver':
height_silver[j] += 1
elif df.iloc[i, 14] == 'Bronze':
height_bronze[j] += 1
print(height_gold)
print(height_silver)
print(height_bronze)
If you have non-integers, this should deal with it:
height_gold = [0] * 230
height_silver = [0] * 230
height_bronze = [0] * 230
for i in range(271116):
j = df.iloc[i, 4]
try:
j = int(round(j))
except ValueError:
continue
if df.iloc[i, 14] == 'Gold':
height_gold[j] += 1
elif df.iloc[i, 14] == 'Silver':
height_silver[j] += 1
elif df.iloc[i, 14] == 'Bronze':
height_bronze[j] += 1
print(height_gold)
print(height_silver)
print(height_bronze)
Upvotes: 2