Laura Chipman
Laura Chipman

Reputation: 45

Python - trimming columns after a specific character or sequence

I've had trouble to find the particular python command for my scenario.

I want to merge datasets based on gene names. (ex/F44E5.4) For most the gene names, there is a 100% match so no problem. For other genes there are variants that are designated with either .#, a lower case vowel or both at the end of the gene name base. (ex/ F26D10.3.2, K01G5.8b, F52D10.3a.2).

I would like to trim those extra bits of information in order to merge with my other data set.

Also in the case where there is variants I would like to choose the higher letter or # variant. (i.e./ a over b, or .1 over .2)

df1
    gene_name    v_1
1    F44E5.4      1
2    F26D10.3.2   2
3    K01G5.8b     3
4    F52D10.3a.2  2 
5    K52GGG.1.1   2 
6    K52GGG.1.2   4 

df2
    gene_name    v_2
1    F44E5.4      .4
2    F26D10.3     .4
3    K01G5.8      .6
4    F52D10.3a    .7 
5    K52GGG.1     .4


merge
    gene_name    v_1     v_2
1    F44E5.4      1       .4
2    F26D10.3     2       .4
3    K01G5.8      3       .6
4    F52D10.3     2       .7 
5    K52GGG.1     2       .4

Upvotes: 0

Views: 54

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210922

you can use this approach:

In [25]: a.drop('gene_name',1) \
    ...:  .groupby(a.gene_name.str.extract(r'([^\.]*\.\d*)', expand=False)) \
    ...:  .max() \
    ...:  .join(b.drop('gene_name',1)
    ...:         .groupby(b.gene_name.str.extract(r'([^\.]*\.\d*)', expand=False))
    ...:         .max()) \
    ...:  .reset_index()
    ...:
Out[25]:
  gene_name  v_1  v_2
0  F26D10.3    2  0.4
1   F44E5.4    1  0.4
2  F52D10.3    2  0.7
3   K01G5.8    3  0.6
4  K52GGG.1    4  0.4

Or using pd.merge() method:

In [26]: pd.merge(
    ...:     a.drop('gene_name',1)
    ...:      .groupby(a.gene_name.str.extract(r'([^\.]*\.\d*)', expand=False))
    ...:      .max(),
    ...:     b.drop('gene_name',1)
    ...:      .groupby(a.gene_name.str.extract(r'([^\.]*\.\d*)', expand=False))
    ...:      .max(),
    ...:     left_index=True,
    ...:     right_index=True
    ...: ).reset_index()
    ...:
Out[26]:
  gene_name  v_1  v_2
0  F26D10.3    2  0.4
1   F44E5.4    1  0.4
2  F52D10.3    2  0.7
3   K01G5.8    3  0.6
4  K52GGG.1    4  0.4

PS I used a and b instead of df1 and df2 to make it bit shorter...

Upvotes: 1

Related Questions