Reputation: 159
I want to analyse pairwise comparisons (genomes) stored in a pandas dataframe:
Genome1 Genome2 SNPs Study1 Study2 Town1 Town2
A B 3 s1 s2 t1 t2
A C 6 s1 s3 t1 t2
A D 7 s1 s3 t1 t2
A E 8 s1 s4 t1 t3
A F 3 s1 s4 t1 t3
Using groupby
how would I get the SNPs for when study 1 is the same as study 2, when they're different, vs when town 1 is the same as town 2, and when they're different?
Upvotes: 1
Views: 74
Reputation: 24281
I think this is what you want. Here we construct an example:
import pandas as pd
text = """Genome1 Genome2 SNPs Study1 Study2 Town1 Town2
A B 3 s2 s2 t1 t1
A C 6 s1 s3 t1 t1
A D 7 s1 s3 t1 t2
A E 8 s1 s4 t1 t3
A F 3 s1 s4 t1 t3
A F 2 s4 s4 t1 t3
A G 5 s1 s1 t1 t3"""
text2 = [line.split() for line in text.split('\n')]
df = pd.DataFrame(text2[1:], columns=text2[0])
And given such a dataframe, we groupby whether the studies are the same, and whether the towns are the same, before outputting a list of all SNPs for each combination of same_study / same_town:
# Creating boolean columns for grouping data
df['same_study'] = df['Study1'] == df['Study2']
df['same_town'] = df['Town1'] == df['Town2']
# Creating lists of values in each group
snps_df = df.groupby(['same_study', 'same_town'])['SNPs'].apply(lambda group_series: group_series.tolist()).reset_index()
>>> print(snps_df)
same_study same_town SNPs
0 False False [7, 8, 3]
1 False True [6]
2 True False [2, 5]
3 True True [3]
Upvotes: 1