Reputation:
I have a dataframe as follows:
a=["003C", "003P1", "003P1", "003P1", "004C", "004P1", "004P2", "003C", "003P2", "003P1", "003C", "003P1", "003P2", "003C", "003P1", "004C", "004P2", "001C", "001P1"]
b=["chr18", "chr20", "chr8", "chr8", "chr11", "chr11", "chr11", "chr11", "chr11", "chr11", "chr1", "chr1", "chr1", "chr1", "chr1", "chr11", "chr11", "chr9", "chr9"]
c=[48399,145653,244695,244695,1163940,1163940,1163940,5986513,5986513,5986513,248650751,248650751,248650751,125895,125895,2587895,2587895,14587952,14587952]
d=["C", "G", "C", "C", "C", "C", "C", "G", "G", "G", "T", "T", "T", "T", "T", "C", "C", "T", "T"]
e=["A", "T", "A", "A", "G", "G", "G", "A", "A", "A", "A", "A", "A", "A", "A", "G", "G", "C", "C"]
#Make dataframe
df = pd.DataFrame({'Sample':a, 'CHROM':b, 'POS':c, 'REF':d, 'ALT':e})
df
Sample CHROM POS REF ALT
0 003C chr18 48399 C A
1 003P1 chr20 145653 G T
2 003P1 chr8 244695 C A
3 003P1 chr8 244695 C A
4 004C chr11 1163940 C G
5 004P1 chr11 1163940 C G
6 004P2 chr11 1163940 C G
7 003C chr11 5986513 G A
8 003P2 chr11 5986513 G A
9 003P1 chr11 5986513 G A
10 003C chr1 248650751 T A
11 003P1 chr1 248650751 T A
12 003P2 chr1 248650751 T A
13 003C chr1 125895 T A
14 003P1 chr1 125895 T A
15 004C chr11 2587895 C G
16 004P2 chr11 2587895 C G
17 001C chr9 14587952 T C
18 001P1 chr9 14587952 T C
I wanted to extract dataframe that matches 'CHROM' 'POS' 'REF' 'ALT' for df['Sample'] with C common with P1 or P2 or P1 & P2. For example 003C : has its corrsponding 003P1 or 003P2 with with all matching values 'CHROM' 'POS' 'REF' 'ALT' see index 7,8,9 and 13,14 and 10,11,12. I wanted to extract them all: I was to do perform this using following code:
c = ['CHROM', 'POS', 'REF', 'ALT', 'INT']
df[['INT','STR']] = df['Sample'].str.extract(r'(\d+)(.*)')
m = df['STR'].isin(['C', 'P1', 'P2'])
m1 = df['STR'].eq('C').groupby([*df[c].values.T]).transform('any')
m2 = df['STR'].mask(~m).groupby([*df[c].values.T]).transform('nunique').ge(2)
df = df[m & m1 & m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)
I want to find out the Sample
containing C
is derived from P1
or P2
or both. The expected output is :
Sample CHROM POS REF ALT DERIVE
003C chr18 48399 C A
003C chr1 125895 T A P1
003P1 chr1 125895 T A
003P1 chr20 145653 G T
003P1 chr8 244695 C A
003P1 chr8 244695 C A
004C chr11 1163940 C G BOTH
004P1 chr11 1163940 C G
004P2 chr11 1163940 C G
004C chr11 2587895 C G P2
004P2 chr11 2587895 C G
003C chr11 5986513 G A BOTH
003P2 chr11 5986513 G A
003P1 chr11 5986513 G A
001C chr9 14587952 T C P1
001P1 chr9 14587952 T C
003C chr1 248650751 T A BOTH
003P1 chr1 248650751 T A
003P2 chr1 248650751 T A
Upvotes: 1
Views: 119
Reputation: 862671
If per groups are unique STR
column values first convert not match by ['C', 'P1', 'P2']
to empty strings, join per groups and mapping, then extract only rows with C
by m1
and replace missing values to empty strings:
c = ['CHROM', 'POS', 'REF', 'ALT', 'INT']
df[['INT','STR']] = df['Sample'].str.extract(r'(\d+)(.*)')
#sorting for order C,P1,P2
df = df.sort_values('STR')
m = df['STR'].isin(['C', 'P1', 'P2'])
m1 = df['STR'].eq('C')
d = {'CP1': 'P1', 'CP2':'P2', 'CP1P2':'BOTH'}
df['DERIVE'] = (df.assign(STR=df['STR'].where(m,''))
.groupby(c)['STR']
.transform(''.join)
.map(d)
.where(m1, '')
.fillna(''))
df = df.sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)
print (df)
Sample CHROM POS REF ALT DERIVE
0 003C chr18 48399 C A
1 003C chr1 125895 T A P1
2 003P1 chr1 125895 T A
3 003P1 chr20 145653 G T
4 003P1 chr8 244695 C A
5 003P1 chr8 244695 C A
6 004C chr11 1163940 C G BOTH
7 004P2 chr11 1163940 C G
8 004P1 chr11 1163940 C G
9 004C chr11 2587895 C G P2
10 004P2 chr11 2587895 C G
11 003P1 chr11 5986513 G A
12 003C chr11 5986513 G A BOTH
13 003P2 chr11 5986513 G A
14 001C chr9 14587952 T C P1
15 001P1 chr9 14587952 T C
16 003C chr1 248650751 T A BOTH
17 003P2 chr1 248650751 T A
18 003P1 chr1 248650751 T A
Upvotes: 0