user2773872
user2773872

Reputation: 1

Manipulating Data in an HTML File

I am using python and I have an HTML file that has a table containing the sample name, gene name, and number of cases and controls in the experiment. Like this...

Sample    Gene     Cases,Controls
snow      NGF       1,2
sun       NGF       2,3
sun       NGF       1,0
snow      NGF       1,3

I need to separate the cases and controls into 2 separate columns and then also add columns for corrected cases and corrected controls. So if the sample, is snow, the # of cases has to be multiplied by 0.8 and if the sample is sun, the # of controls has to be multiplied by 1.5. I am not sure how to identify the cases and controls in the line and then assign the case and control to different variables so that I can manipulate them.

Upvotes: 0

Views: 234

Answers (1)

Kyle Kelley
Kyle Kelley

Reputation: 14144

Try out the pandas library for this. Make sure to install lxml as well.

First off, let's pretend this is your html:

<table>
<tr><th>Sample</th><th>Gene</th><th>Cases,Controls</th></tr>
<tr><td>snow</td><td>NGF</td><td>1,2</td></tr>
<tr><td>sun</td><td>NGF</td><td>2,3</td></tr>
<tr><td>sun</td><td>NGF</td><td>1,0</td></tr>
<tr><td>snow</td><td>NGF</td><td>1,3</td></tr>
</table>

I'll also assume you read that into a variable called html.

import pandas
tables = pandas.io.html.read_html(html,header=0,infer_types=False)

# Pandas reads each table read from the HTML into a list,
# we only have one here
table = tables[0]

That made a DataFrame with your table.

panda dataframe

Which you can now operate on, pandas style! In particular, you probably want to pull out cases and controls.

# Break out those cases and controls into a DataFrame
case_control_list = table["Cases,Controls"].str.split(',',1).tolist(),
case_control = pandas.DataFrame(case_control_list, columns = ["Cases", "Controls"])

case control pandas

Upvotes: 1

Related Questions