Reputation: 5759
I know I have read a way to do what I am looking for but I can't seem to find it.
I have a pandas DataFrame that looks something like this:
Chrom Loc WT Var Change ConvChange AO DP VAF \
0 chr1 115227855 T A T>A T>A 5 19346 0.000258451
IntEx Gene Upstream Downstream Individual
0 TIII TIIIa T C 1
I would like to understand across Individual
where the greatest differences in VAF
exist at each unique Chrom
, Loc
and Change
combination.
I was thinking of changing the current DataFrame so it looks something like below to give me standard deviation column that could be sorted to give me the locations with the greatest differences. Is this a good approach and is there a good way to do something like this?
1 2 3 Mean Std
chr1-115227855-T>A 0.000258451 0.000548128 0.000789456 0.000532011 0.0002170812
Upvotes: 0
Views: 52
Reputation: 153460
You can use some Pandas reshaping:
MCVE:
Given:
print(df)
Chrom Individual VAF Var WT
0 chr1 1 0.076397 A T
1 chr1 2 0.964344 A T
2 chr1 3 0.563713 A T
Reshape and aggregate:
df.set_index(['Chrom','WT','Var','Individual'])['VAF'].unstack(-1)\
.pipe(lambda x: x.assign(mean=x.mean(1), std=x.std(1)))
Output:
Individual 1 2 3 mean std
Chrom WT Var
chr1 T A 0.076397 0.964344 0.563713 0.534818 0.444678
Upvotes: 1