Reputation: 45
I've had trouble to find the particular python command for my scenario.
I want to merge datasets based on gene names. (ex/F44E5.4) For most the gene names, there is a 100% match so no problem. For other genes there are variants that are designated with either .#, a lower case vowel or both at the end of the gene name base. (ex/ F26D10.3.2, K01G5.8b, F52D10.3a.2).
I would like to trim those extra bits of information in order to merge with my other data set.
Also in the case where there is variants I would like to choose the higher letter or # variant. (i.e./ a over b, or .1 over .2)
df1
gene_name v_1
1 F44E5.4 1
2 F26D10.3.2 2
3 K01G5.8b 3
4 F52D10.3a.2 2
5 K52GGG.1.1 2
6 K52GGG.1.2 4
df2
gene_name v_2
1 F44E5.4 .4
2 F26D10.3 .4
3 K01G5.8 .6
4 F52D10.3a .7
5 K52GGG.1 .4
merge
gene_name v_1 v_2
1 F44E5.4 1 .4
2 F26D10.3 2 .4
3 K01G5.8 3 .6
4 F52D10.3 2 .7
5 K52GGG.1 2 .4
Upvotes: 0
Views: 54
Reputation: 210922
you can use this approach:
In [25]: a.drop('gene_name',1) \
...: .groupby(a.gene_name.str.extract(r'([^\.]*\.\d*)', expand=False)) \
...: .max() \
...: .join(b.drop('gene_name',1)
...: .groupby(b.gene_name.str.extract(r'([^\.]*\.\d*)', expand=False))
...: .max()) \
...: .reset_index()
...:
Out[25]:
gene_name v_1 v_2
0 F26D10.3 2 0.4
1 F44E5.4 1 0.4
2 F52D10.3 2 0.7
3 K01G5.8 3 0.6
4 K52GGG.1 4 0.4
Or using pd.merge()
method:
In [26]: pd.merge(
...: a.drop('gene_name',1)
...: .groupby(a.gene_name.str.extract(r'([^\.]*\.\d*)', expand=False))
...: .max(),
...: b.drop('gene_name',1)
...: .groupby(a.gene_name.str.extract(r'([^\.]*\.\d*)', expand=False))
...: .max(),
...: left_index=True,
...: right_index=True
...: ).reset_index()
...:
Out[26]:
gene_name v_1 v_2
0 F26D10.3 2 0.4
1 F44E5.4 1 0.4
2 F52D10.3 2 0.7
3 K01G5.8 3 0.6
4 K52GGG.1 4 0.4
PS I used a
and b
instead of df1
and df2
to make it bit shorter...
Upvotes: 1