user2110417
user2110417

Reputation:

How to assign tags to new column based on the different row values in the dataframe?

I have a dataframe as follows:

a=["003C", "003P1", "003P1", "003P1", "004C", "004P1", "004P2", "003C", "003P2", "003P1", "003C", "003P1", "003P2", "003C", "003P1", "004C", "004P2", "001C", "001P1"]
b=["chr18", "chr20", "chr8", "chr8", "chr11", "chr11", "chr11", "chr11", "chr11", "chr11", "chr1", "chr1", "chr1", "chr1", "chr1", "chr11", "chr11", "chr9", "chr9"]
c=[48399,145653,244695,244695,1163940,1163940,1163940,5986513,5986513,5986513,248650751,248650751,248650751,125895,125895,2587895,2587895,14587952,14587952]
d=["C", "G", "C", "C", "C", "C", "C", "G", "G", "G", "T", "T", "T", "T", "T", "C", "C", "T", "T"]
e=["A", "T", "A", "A", "G", "G", "G", "A", "A", "A", "A", "A", "A", "A", "A", "G", "G", "C", "C"]
#Make dataframe
df = pd.DataFrame({'Sample':a, 'CHROM':b, 'POS':c, 'REF':d, 'ALT':e})

df

    Sample  CHROM   POS         REF  ALT
0   003C    chr18   48399       C    A
1   003P1   chr20   145653      G    T
2   003P1   chr8    244695      C    A
3   003P1   chr8    244695      C    A
4   004C    chr11   1163940     C    G
5   004P1   chr11   1163940     C    G
6   004P2   chr11   1163940     C    G
7   003C    chr11   5986513     G    A
8   003P2   chr11   5986513     G    A
9   003P1   chr11   5986513     G    A
10  003C    chr1    248650751   T    A
11  003P1   chr1    248650751   T    A
12  003P2   chr1    248650751   T    A
13  003C    chr1    125895      T    A
14  003P1   chr1    125895      T    A
15  004C    chr11   2587895     C    G
16  004P2   chr11   2587895     C    G
17  001C    chr9    14587952    T   C
18  001P1   chr9    14587952    T   C

I wanted to extract dataframe that matches 'CHROM' 'POS' 'REF' 'ALT' for df['Sample'] with C common with P1 or P2 or P1 & P2. For example 003C : has its corrsponding 003P1 or 003P2 with with all matching values 'CHROM' 'POS' 'REF' 'ALT' see index 7,8,9 and 13,14 and 10,11,12. I wanted to extract them all: I was to do perform this using following code:

c = ['CHROM', 'POS', 'REF', 'ALT', 'INT']
df[['INT','STR']] = df['Sample'].str.extract(r'(\d+)(.*)')

m  = df['STR'].isin(['C', 'P1', 'P2'])
m1 = df['STR'].eq('C').groupby([*df[c].values.T]).transform('any')
m2 = df['STR'].mask(~m).groupby([*df[c].values.T]).transform('nunique').ge(2)

df = df[m & m1 & m2].sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)

I want to find out the Sample containing C is derived from P1 or P2 or both. The expected output is :

Sample  CHROM   POS         REF ALT DERIVE
003C    chr18   48399       C   A   
003C    chr1    125895      T   A   P1
003P1   chr1    125895      T   A   
003P1   chr20   145653      G   T   
003P1   chr8    244695      C   A   
003P1   chr8    244695      C   A   
004C    chr11   1163940     C   G   BOTH
004P1   chr11   1163940     C   G   
004P2   chr11   1163940     C   G   
004C    chr11   2587895     C   G   P2
004P2   chr11   2587895     C   G   
003C    chr11   5986513     G   A   BOTH
003P2   chr11   5986513     G   A   
003P1   chr11   5986513     G   A   
001C    chr9    14587952    T   C     P1
001P1   chr9    14587952    T   C   
003C    chr1    248650751   T   A   BOTH
003P1   chr1    248650751   T   A   
003P2   chr1    248650751   T   A   

Upvotes: 1

Views: 119

Answers (1)

jezrael
jezrael

Reputation: 862671

If per groups are unique STR column values first convert not match by ['C', 'P1', 'P2'] to empty strings, join per groups and mapping, then extract only rows with C by m1 and replace missing values to empty strings:

c = ['CHROM', 'POS', 'REF', 'ALT', 'INT']
df[['INT','STR']] = df['Sample'].str.extract(r'(\d+)(.*)')
#sorting for order C,P1,P2
df = df.sort_values('STR')

m  = df['STR'].isin(['C', 'P1', 'P2'])
m1 = df['STR'].eq('C')

d = {'CP1': 'P1', 'CP2':'P2', 'CP1P2':'BOTH'}
df['DERIVE'] = (df.assign(STR=df['STR'].where(m,''))
                  .groupby(c)['STR']
                  .transform(''.join)
                  .map(d)
                  .where(m1, '')
                  .fillna(''))

df = df.sort_values('POS', ignore_index=True).drop(['INT', 'STR'], 1)

print (df)
   Sample  CHROM        POS REF ALT DERIVE
0    003C  chr18      48399   C   A       
1    003C   chr1     125895   T   A     P1
2   003P1   chr1     125895   T   A       
3   003P1  chr20     145653   G   T       
4   003P1   chr8     244695   C   A       
5   003P1   chr8     244695   C   A       
6    004C  chr11    1163940   C   G   BOTH
7   004P2  chr11    1163940   C   G       
8   004P1  chr11    1163940   C   G       
9    004C  chr11    2587895   C   G     P2
10  004P2  chr11    2587895   C   G       
11  003P1  chr11    5986513   G   A       
12   003C  chr11    5986513   G   A   BOTH
13  003P2  chr11    5986513   G   A       
14   001C   chr9   14587952   T   C     P1
15  001P1   chr9   14587952   T   C       
16   003C   chr1  248650751   T   A   BOTH
17  003P2   chr1  248650751   T   A       
18  003P1   chr1  248650751   T   A     

Upvotes: 0

Related Questions