Reputation: 409
I am writing a script that produces histograms of specific columns in a tab-delimited text file. Currently, the program will create a single graph from a hard coded column number that I am using as a placeholder.
The input table looks something like this:
SAMPID TRAIT COHORT AGE BMI WEIGHT WAIST HEIGHT LDL HDL
123 LDL STUDY1 52 32.2 97.1 102 149 212.5 21.4
456 LDL STUDY1 33 33.7 77.0 101 161 233.2 61.2
789 LDL STUDY2 51 25.1 67.1 107 162 231.1 21.3
abc LDL STUDY2 76 33.1 80.4 99 134 220.5 21.2
...
And I have the following code:
import csv
import numpy
from matplotlib import pyplot
r = csv.reader(open("path",'r'), delimiter = '\t')
input_table=[]
for row in r:
input_table.append(row)
column=[]
missing=0
nonmissing=0
for E in input_table[1:3635]: # the number of rows in the input table
if E[8] == "": missing+=1 # [8] is hard coded now, want to change this to column header name "LDL"
else:
nonmissing +=1
column.append(float(E[8]))
pyplot.hist(column, bins=20, label="the label") # how to handle multiple histogram outputs if multiple column headers are specified?
print "n = ", nonmissing
print "numer of missing values: ", missing
pyplot.show()
Can anyone offer suggestions that would allow me to expand/improve my program to do any of the following?
graph data from columns specified by header name, not the column number
iterate over a list containing multiple header names to create/display several histograms at once
Create a graph that only includes a subset of the data, as specified by a specific value in a column (ie, for a specific sample ID, or a specific COHORT value)
One component not shown here is that I will eventually have a separate input file that will contain a list of headers (ie "HDL", "LDL", "HEIGHT") needing to be graphed separately, but then displayed together in a grid-like manner.
I can additional information if needed.
Upvotes: 1
Views: 3192
Reputation: 1382
Well, I have a few comments and suggestions, hope it helps.
In my opinion, the first thing you should do to get all those things you want is to structure your data. Try to create, for each row from the file, a dictionary like
{'SAMPID': <value_1>, 'TRAIL': <value_2>, ...}
And then you will have a list of such dict objects, and you will be able to iterate it and filter by any field you wish.
That is the first and most important point.
After you do that, modularize your code, do not just create a single script to get all the job done. Identify the pieces of code that will be redundant (as a filtering loop), put it into a function and call it, passing all necessary args.
One aditional detail: You don't need to hadcode the size of your list as in
for E in input_table[1:3635]:
Just write
for E in input_table[1:-1]
And it should do for every list. Of course, if you stop treating you data as raw text, that won't be necessary. Just iterate your list of dicts normally.
If you have more doubts, let me know. Francisco
Upvotes: 4