Reputation: 61
I have to recode some haplotypes that I have to code. I have them on a Pandas DataFrame of 305 rows and 129902 columns, and it looks like this (only one column and 20 rows):
rs# rs12914615
SNPalleles C/T
chrom chr15
pos 98259206
strand +
genome_build ncbi_B36
center affymetrix
protLSID urn:LSID:affymetrix.hapmap.org:Protocol:Genome...
assayLSID urn:LSID:affymetrix.hapmap.org:Assay:SNP_A-837...
panelLSID urn:lsid:dcc.hapmap.org:Panel:CEPH-30-trios:1
QC_code QC+
NA06985 CT
NA06991 CT
NA06993 CT
NA06993.dup CC
NA06994 CC
NA07000 CC
NA07019 CT
NA07022 CT
The idea is to compare if the values for each individual (NA06...) have both nucleotides in common with the wildtype (the first letter of the SNPalleles row) or if not, code it accordingly.
My probles is that I don't know how to iterate over the data frame while making reference to it's wildtype that is on other row in the same column.
The output should look something like this:
NA06985 1
NA06991 1
NA06993 1
NA06993.dup 0
NA06994 0
NA07000 0
NA07019 1
NA07022 1
Being 0 the Wildtype (CC for this gene), 1 the heterozygote (CT) and 2 the mutant homozygote (TT).
Thanks for the help.
Upvotes: 1
Views: 164
Reputation: 294288
df.filter(
like='NA', axis=0
).eq(df.loc['SNPalleles'].str.replace('/', '')).astype(int)
rs12914615
rs#
NA06985 1
NA06991 1
NA06993 1
NA06993.dup 0
NA06994 0
NA07000 0
NA07019 1
NA07022 1
Upvotes: 1