alexhli
alexhli

Reputation: 409

Python code to create graphs from columns of data

I am writing a script that produces histograms of specific columns in a tab-delimited text file. Currently, the program will create a single graph from a hard coded column number that I am using as a placeholder.

The input table looks something like this:

 SAMPID   TRAIT   COHORT   AGE   BMI    WEIGHT   WAIST    HEIGHT  LDL     HDL 
 123      LDL     STUDY1   52    32.2   97.1     102      149     212.5   21.4 
 456      LDL     STUDY1   33    33.7   77.0     101      161     233.2   61.2 
 789      LDL     STUDY2   51    25.1   67.1     107      162     231.1   21.3 
 abc      LDL     STUDY2   76    33.1   80.4     99       134     220.5   21.2 
 ...

And I have the following code:

import csv
import numpy
from  matplotlib import pyplot

r = csv.reader(open("path",'r'), delimiter = '\t')

input_table=[]
for row in r:
   input_table.append(row)

column=[]
missing=0
nonmissing=0
for E in input_table[1:3635]:   # the number of rows in the input table
    if E[8] == "": missing+=1   # [8] is hard coded now, want to change this to column header name "LDL"
    else:
        nonmissing +=1
        column.append(float(E[8]))

pyplot.hist(column, bins=20, label="the label")   # how to handle multiple histogram outputs if multiple     column headers are specified?

print "n =  ", nonmissing
print "numer of missing values: ", missing
pyplot.show()

Can anyone offer suggestions that would allow me to expand/improve my program to do any of the following?

  1. graph data from columns specified by header name, not the column number

  2. iterate over a list containing multiple header names to create/display several histograms at once

  3. Create a graph that only includes a subset of the data, as specified by a specific value in a column (ie, for a specific sample ID, or a specific COHORT value)

One component not shown here is that I will eventually have a separate input file that will contain a list of headers (ie "HDL", "LDL", "HEIGHT") needing to be graphed separately, but then displayed together in a grid-like manner.

I can additional information if needed.

Upvotes: 1

Views: 3192

Answers (1)

Francisco
Francisco

Reputation: 1382

Well, I have a few comments and suggestions, hope it helps.

In my opinion, the first thing you should do to get all those things you want is to structure your data. Try to create, for each row from the file, a dictionary like

{'SAMPID': <value_1>, 'TRAIL': <value_2>, ...}

And then you will have a list of such dict objects, and you will be able to iterate it and filter by any field you wish.

That is the first and most important point.

After you do that, modularize your code, do not just create a single script to get all the job done. Identify the pieces of code that will be redundant (as a filtering loop), put it into a function and call it, passing all necessary args.

One aditional detail: You don't need to hadcode the size of your list as in

for E in input_table[1:3635]:

Just write

for E in input_table[1:-1]

And it should do for every list. Of course, if you stop treating you data as raw text, that won't be necessary. Just iterate your list of dicts normally.

If you have more doubts, let me know. Francisco

Upvotes: 4

Related Questions