Pandas groupby whether columns are the same or different

Question

I want to analyse pairwise comparisons (genomes) stored in a pandas dataframe:

Genome1 Genome2 SNPs Study1 Study2 Town1 Town2
      A       B    3     s1     s2    t1    t2
      A       C    6     s1     s3    t1    t2
      A       D    7     s1     s3    t1    t2
      A       E    8     s1     s4    t1    t3
      A       F    3     s1     s4    t1    t3

Using groupby how would I get the SNPs for when study 1 is the same as study 2, when they're different, vs when town 1 is the same as town 2, and when they're different?

iacob · Accepted Answer

I think this is what you want. Here we construct an example:

import pandas as pd

text = """Genome1 Genome2 SNPs Study1 Study2 Town1 Town2
      A       B    3     s2     s2    t1    t1
      A       C    6     s1     s3    t1    t1
      A       D    7     s1     s3    t1    t2
      A       E    8     s1     s4    t1    t3
      A       F    3     s1     s4    t1    t3
      A       F    2     s4     s4    t1    t3
      A       G    5     s1     s1    t1    t3"""

text2 = [line.split() for line in text.split('
')]
df = pd.DataFrame(text2[1:], columns=text2[0])

And given such a dataframe, we groupby whether the studies are the same, and whether the towns are the same, before outputting a list of all SNPs for each combination of same_study / same_town:

# Creating boolean columns for grouping data
df['same_study'] = df['Study1'] == df['Study2']
df['same_town'] = df['Town1'] == df['Town2']

# Creating lists of values in each group
snps_df = df.groupby(['same_study', 'same_town'])['SNPs'].apply(lambda group_series: group_series.tolist()).reset_index()

>>> print(snps_df)

   same_study  same_town       SNPs
0       False      False  [7, 8, 3]
1       False       True        [6]
2        True      False     [2, 5]
3        True       True        [3]

Pandas groupby whether columns are the same or different

Answers (1)

Related Questions