Reputation: 1
I am using python and I have an HTML file that has a table containing the sample name, gene name, and number of cases and controls in the experiment. Like this...
Sample Gene Cases,Controls
snow NGF 1,2
sun NGF 2,3
sun NGF 1,0
snow NGF 1,3
I need to separate the cases and controls into 2 separate columns and then also add columns for corrected cases and corrected controls. So if the sample, is snow, the # of cases has to be multiplied by 0.8 and if the sample is sun, the # of controls has to be multiplied by 1.5. I am not sure how to identify the cases and controls in the line and then assign the case and control to different variables so that I can manipulate them.
Upvotes: 0
Views: 234
Reputation: 14144
Try out the pandas library for this. Make sure to install lxml as well.
First off, let's pretend this is your html:
<table>
<tr><th>Sample</th><th>Gene</th><th>Cases,Controls</th></tr>
<tr><td>snow</td><td>NGF</td><td>1,2</td></tr>
<tr><td>sun</td><td>NGF</td><td>2,3</td></tr>
<tr><td>sun</td><td>NGF</td><td>1,0</td></tr>
<tr><td>snow</td><td>NGF</td><td>1,3</td></tr>
</table>
I'll also assume you read that into a variable called html
.
import pandas
tables = pandas.io.html.read_html(html,header=0,infer_types=False)
# Pandas reads each table read from the HTML into a list,
# we only have one here
table = tables[0]
That made a DataFrame with your table.
Which you can now operate on, pandas style! In particular, you probably want to pull out cases and controls.
# Break out those cases and controls into a DataFrame
case_control_list = table["Cases,Controls"].str.split(',',1).tolist(),
case_control = pandas.DataFrame(case_control_list, columns = ["Cases", "Controls"])
Upvotes: 1