jeffTHICC
jeffTHICC

Reputation: 3

I would like to learn how to data analyse, but I am having trouble with conditional statements and extracting data to plot using matplotlib

Basically, I am new to data analysis and I got a dataset that I would like to analyze and get some graphs to test out hypotheses and learn more about the data I got about the Olympics.

Now, I would like to find out which age gets the most gold, silver, and bronze medals and the same goes for height.

This is the code I have created, I think it works (i am not sure) but takes like 20 minutes to process and the format is weird which gives me trouble putting in a graph. I would like to know how I can cut the processing time significantly shorter and how I would be able to graph it ->

#calculating number of medals each person has
j=0
i=0
height_gold=[0]*230
height_silver=[0]*230
height_bronze=[0]*230

while(i<271116):
    while(j<230):
        if df.iloc[i,4]==j:
            if df.iloc[i,14]=='Gold':
                height_gold[j]=height_gold[j]+1
            if df.iloc[i,14]=='Silver':
                height_silver[j]=height_silver[j]+1
            if df.iloc[i,14]=='Bronze':
                height_bronze[j]=height_bronze[j]+1
        j=j+1
        #print('new_age')
    i=i+1
    j=0
    #print('new_row')

print(height_gold)
print(height_silver)
print(height_bronze)

Also, I would very much like to know how I would be able to find out which sport gets the most medals, which Olympic year gave out the most medals, and which country gets the most medals.

Now that I am here, I would also like to ask what else I could find out from this csv.file here ->

the CSV file/data I am using to get data to plot a graph

Upvotes: 0

Views: 79

Answers (2)

EvertW
EvertW

Reputation: 1180

I have some improvements that would significantly speed things up:

  • Your inner loop (the one that counts to 230) is unnecessary. You could use the value from df.iloc[i,4] as index in the arrays.
  • You could also use the color of the medal as index into the list of medal counts.

With these, you get the following code:

medals={'Gold': [0]*230,
        'Silver': [0]*230,
        'Bronze': [0]*230}

for i in range(271116):
    country = int(df.iloc[i,4])
    medals[df.iloc[i,14]][country] += 1

print(medals['Gold'])
print(medals['Silver'])
print(medals['Bronze'])

Upvotes: 1

LeopardShark
LeopardShark

Reputation: 4446

The problem is that you’re finding j in an inefficient way (checking all 230 possibilities). You can just set j to df.iloc[i, 4]. You’ve also built your own for loop, which I’ve fixed here as well.

height_gold = [0] * 230
height_silver = [0] * 230
height_bronze = [0] * 230

for i in range(271116):
    j = df.iloc[i, 4]
    if df.iloc[i, 14] == 'Gold':
        height_gold[j] += 1
    elif df.iloc[i, 14] == 'Silver':
        height_silver[j] += 1
    elif df.iloc[i, 14] == 'Bronze':
        height_bronze[j] += 1

print(height_gold)
print(height_silver)
print(height_bronze)

If you have non-integers, this should deal with it:

height_gold = [0] * 230
height_silver = [0] * 230
height_bronze = [0] * 230

for i in range(271116):
    j = df.iloc[i, 4]
    try:
        j = int(round(j))
    except ValueError:
        continue
    if df.iloc[i, 14] == 'Gold':
        height_gold[j] += 1
    elif df.iloc[i, 14] == 'Silver':
        height_silver[j] += 1
    elif df.iloc[i, 14] == 'Bronze':
        height_bronze[j] += 1
print(height_gold)
print(height_silver)
print(height_bronze)

Upvotes: 2

Related Questions