Reputation: 157
I have 13 columns with 303 rows/lines I have divided the 303 rows between healthy patients and Ill patients I am now trying to get the averages for each column in CSV file for the healthy patients and ill patients to compare and contrast.The end example of the problem is this and the CSV file has numbers like the averages in this example with the exception of ?'s in missing data.
Please enter a training file name: train.csv
Total Lines Processed: 303
Total Healthy Count: 164
Total Ill Count: 139
Averages of Healthy Patients:
[52.59, 0.56, 2.79, 129.25, 242.64, 0.14, 0.84, 158.38, 0.14, 0.59, 1.41, 0.27, 3.77, 0.00]
Averages of Ill Patients:
[56.63, 0.82, 3.59, 134.57, 251.47, 0.16, 1.17, 139.26, 0.55, 1.57, 1.83, 1.13, 5.80, 2.04]
Seperation Values are:
[54.61, 0.69, 3.19, 131.91, 247.06, 0.15, 1.00, 148.82, 0.34, 1.08, 1.62, 0.70, 4.79, 1.02]
I still have a long way to go on my code, I'm just looking for a simplistic way to get the averages of the patients. My current method only gets column 13 but I need all 13 like above. Any help on which way I should try to go with solving this would be appreciated.
import csv
#turn csv files into a list of lists
with open('train.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
csv_data = list(reader)
i_list = []
for row in csv_data:
if (row and int(row[13]) > 0):
i_list.append(int(row[13]))
H_list = []
for row in csv_data:
if (row and int(row[13]) <= 0):
H_list.append(int(row[13]))
Icount = len(i_list)
IPavg = sum(i_list)/len(i_list)
Hcount = len(H_list)
HPavg = sum(H_list)/len(H_list)
file = open("train.csv")
numline = len(file.readlines())
print(numline)
print("Total amount of healthy patients " + str(Icount))
print("Total amount of ill patients " + str(Hcount))
print("Averages of healthy patients " + str(HPavg))
print("Averages of ill patients " + str(IPavg))
My only idea would be to do the same as I did to get the averages for row 13 but I don't know how I would keep the healthy patients separated from the Ill patients.
Upvotes: 1
Views: 310
Reputation: 123453
If you want averages for each column, then it would easiest to process all of them at once as you read the file — it's not that difficult. You didn't specify what version of Python you're using, but the following should work in both (although it could be optimized for one or the other).
import csv
NUMCOLS = 13
with open('train.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
# initialize totals
Icount = 0
Hcount = 0
H_col_totals = [0.0 for _ in range(NUMCOLS)] # init to floating pt value for Py 2
I_col_totals = [0.0 for _ in range(NUMCOLS)] # init to floating pt value for Py 2
# read and process file
for row in reader:
if row: # non-blank line?
# update running total for each column
row = list(map(int, row))
for col in range(NUMCOLS):
if row[col] > 0:
Icount += 1
I_col_totals[col] += row[col]
else:
Hcount += 1
H_col_totals[col] += row[col]
# compute average of data in each column
if Hcount < 1: # avoid dividing by zero
HPavgs = [0.0 for _ in range(NUMCOLS)]
else:
HPavgs = [H_col_totals[col]/Hcount for col in range(NUMCOLS)]
if Icount < 1: # avoid dividing by zero
IPavgs = [0.0 for _ in range(NUMCOLS)]
else:
IPavgs = [I_col_totals[col]/Icount for col in range(NUMCOLS)]
print("Total number of healthy patients: {}".format(Hcount))
print("Total number of ill patients: {}".format(Icount))
print("Averages of healthy patients: " +
", ".join(format(HPavgs[col], ".2f") for col in range(NUMCOLS)))
print("Averages of ill patients: " +
", ".join(format(IPavgs[col], ".2f") for col in range(NUMCOLS)))
Upvotes: 2
Reputation: 4070
Why dont you use pandas module?
It would be lot easier to accomplish what you want.
In [42]: import pandas as pd
In [43]: import numpy as np
In [44]: df = pd.DataFrame(np.random.randn(10, 4))
In [45]: df
Out[45]:
0 1 2 3
0 1.290657 -0.376132 -0.482188 1.117486
1 -0.620332 -0.247143 0.214548 -0.975472
2 1.803212 -0.073028 0.224965 0.069488
3 -0.249340 0.491075 0.083451 0.282813
4 -0.477317 0.059482 0.867047 -0.656830
5 0.117523 0.089099 -0.561758 0.459426
6 -0.173780 -0.066054 -0.943881 -0.301504
7 1.250235 -0.949350 -1.119425 1.054016
8 1.031764 -1.470245 -0.976696 0.579424
9 0.300025 1.141415 1.503518 1.418005
In [46]: df.mean()
Out[46]:
0 0.427265
1 -0.140088
2 -0.119042
3 0.304685
dtype: float64
In you case you can try:
In [47]: df = pd.read_csv('yourfile.csv')
Upvotes: 1