Reputation: 467
I have a dataframe that that looks like this:
chr | start | end | plus | minus | total | in_control | sites_in_cluster | mean | cluster |
---|---|---|---|---|---|---|---|---|---|
1 | 1000 | 1005 | 6 | 7 | 13 | Y | 3 | 6 | 36346 |
1 | 1007 | 10012 | 3 | 1 | 4 | N | 3 | 6 | 36346 |
1 | 10014 | 10020 | 0 | 1 | 1 | Y | 3 | 6 | 36346 |
2 | 33532 | 33554 | 1 | 1 | 2 | N | 1 | 2 | 22123 |
in_control==Y
)I want to create an additional column, which tells me what proportion of the sites are in the control. i.e. (sum(in_control==Y) for a cluster)/sites_in_cluster
In this example, we have two rows with in_control==Y
and 3 sites_in_cluster
in cluster 36346. Therefore, cluster_sites_in_control would be 2/3 = 0.66
whereas cluster 22123 only has one site and isn't in the control, so would be 0/1=0
chr | start | end | plus | minus | total | in_control | sites_in_cluster | mean | cluster | cluster_sites_in_control |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1000 | 1005 | 6 | 7 | 13 | Y | 3 | 6 | 36346 | 0.66 |
1 | 1007 | 10012 | 3 | 1 | 4 | N | 3 | 6 | 36346 | 0.66 |
1 | 10014 | 10020 | 0 | 1 | 1 | Y | 3 | 6 | 36346 | 0.66 |
2 | 33532 | 33554 | 1 | 1 | 2 | N | 1 | 2 | 22123 | 0.00 |
I have created code which seemingly accomplishes this, however, it seems to be extremely roundabout and I'm certain there's a better solution out there:
intersect_in_control
# %%
import pandas as pd
#get the number of sites in a control that are 'Y'
number_in_control = pd.DataFrame(intersect_in_control.groupby(['cluster']).in_control.value_counts().unstack(fill_value=0).loc[:,'Y'])
#get the number of breaksites for that cluster
number_of_breaksites = pd.DataFrame(intersect_in_control.groupby(['cluster'])['sites_in_cluster'].count())
#combine these two dataframes
combined_dataframe = pd.concat([number_in_control.reset_index(drop=False), number_of_breaksites.reset_index(drop=True)], axis=1)
#calculate the desired column
combined_dataframe["proportion_in_control"] = combined_dataframe["Y"]/combined_dataframe["sites_in_cluster"]
#left join this new dataframe to the original whilst dropping undesired columns.
cluster_in_control = intersect_in_control.merge((combined_dataframe.drop(["Y","sites_in_cluster"], axis = 1)), on='cluster', how='left')
10 rows of the df as example data:
{'chr': {0: 'chr14',
1: 'chr2',
2: 'chr1',
3: 'chr10',
4: 'chr17',
5: 'chr17',
6: 'chr2',
7: 'chr2',
8: 'chr2',
9: 'chr1',
10: 'chr1'},
'start': {0: 23016497,
1: 133031338,
2: 64081726,
3: 28671025,
4: 45219225,
5: 45219225,
6: 133026750,
7: 133026761,
8: 133026769,
9: 1510391,
10: 15853061},
'end': {0: 23016501,
1: 133031342,
2: 64081732,
3: 28671030,
4: 45219234,
5: 45219234,
6: 133026755,
7: 133026763,
8: 133026770,
9: 1510395,
10: 15853067},
'plus_count': {0: 2,
1: 0,
2: 5,
3: 1,
4: 6,
5: 6,
6: 14,
7: 2,
8: 0,
9: 2,
10: 4},
'minus_count': {0: 6,
1: 7,
2: 1,
3: 5,
4: 0,
5: 0,
6: 0,
7: 0,
8: 2,
9: 3,
10: 1},
'count': {0: 8, 1: 7, 2: 6, 3: 6, 4: 6, 5: 6, 6: 14, 7: 2, 8: 2, 9: 5, 10: 5},
'in_control': {0: 'N',
1: 'N',
2: 'Y',
3: 'N',
4: 'Y',
5: 'Y',
6: 'N',
7: 'Y',
8: 'N',
9: 'Y',
10: 'Y'},
'total_breaks': {0: 8,
1: 7,
2: 6,
3: 6,
4: 6,
5: 6,
6: 18,
7: 18,
8: 18,
9: 5,
10: 5},
'sites_in_cluster': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 3,
7: 3,
8: 3,
9: 1,
10: 1},
'mean_breaks_per_site': {0: 8.0,
1: 7.0,
2: 6.0,
3: 6.0,
4: 6.0,
5: 6.0,
6: 6.0,
7: 6.0,
8: 6.0,
9: 5.0,
10: 5.0},
'cluster': {0: 22665,
1: 24664,
2: 3484,
3: 13818,
4: 23640,
5: 23640,
6: 24652,
7: 24652,
8: 24652,
9: 48,
10: 769}}
Thanks in advance for any help :)
Upvotes: 0
Views: 59
Reputation: 863116
For percentage is possible symplify solution with mean
per boolean column and for create new column use GroupBy.transform
, it working well because True
s apre processing like 1
:
df['cluster_sites_in_control'] = (df['in_control'].eq('Y')
.groupby(df['cluster']).transform('mean'))
print (df)
chr start end plus minus total in_control sites_in_cluster mean \
0 1 1000 1005 6 7 13 Y 3 6
1 1 1007 10012 3 1 4 N 3 6
2 1 10014 10020 0 1 1 Y 3 6
3 2 33532 33554 1 1 2 N 1 2
cluster cluster_sites_in_control
0 36346 0.666667
1 36346 0.666667
2 36346 0.666667
3 22123 0.000000
Upvotes: 3