chippycentra
chippycentra

Reputation: 879

Add new rows in a df depending on another df and conditions with pandas

I have some issues in order to create a new data frame depending on 2 dataframes informations. Here is a dataframe1:

species seq_names   value
dog     seq_C       0.67
cat     seq_F       1.4
cat     seq_E       0.4
dolphin seq_F       0.7
dolphin seq_A       1.9
frog    seq_A       0.8
frog    seq_B       0.40

Here is another dataframe2 :

group_number    col1
1               cat
1               dog
2               dolphin
2               frog
2               seq_X
2               seq_Y

As you can see there are 2 groups. And the idea is to add to these groups the seq_names depending on the species they matches in the df1 and their values.

Here I should get a new_df with the seq_names added:

group_number    sp_seq_names
1               cat
1               dog
1               seq_C
1               seq_F
1               seq_E   
2               dolphin
2               frog
2               seq_A
2               seq_B
2               seq_X
2               seq_Y

Edit for IMC : as you can see seq_X and seq_Y are still present at the end.

The group_number 1 gained the seq_names C,F and E because all where matching with at least one species in the group. But the subtlety is here: As you can also see group1 gained the seq_name F but not the group_number 2 despit the fact that dolphin was also matching this sequence, but the cat had a better value than dolphin with this seq_name (1.4 against 0.7.) So the group_number 2 only gained 2 seq_names A and B.

Does someone have an idea using pandas?

I tried something with a merge:

pd.merge(df2, df1, left_on=['col1'],right_on=['species'],how='outer')

   group_number     col1  species seq_names  value
0             1      cat      cat     seq_F   1.40
1             1      cat      cat     seq_E   0.40
2             1      dog      dog     seq_C   0.67
3             2  dolphin  dolphin     seq_F   0.70
4             2  dolphin  dolphin     seq_A   1.80
5             2     frog     frog     seq_B   0.40
6             2     frog     frog     seq_A   0.80

and then I created the desired df :

df=[]
for species, group  in  zip (df_new['seq_names'],df_new['group_number']):
 df.append({'groups':group,'sp_seq_names':species})
for species, group  in  zip (df_new['species'],df_new['group_number']):
 df.append({'groups':group,'sp_seq_names':species})

and I get :

>>> pd.DataFrame(df)
       col1  groups
0       cat       1
1       cat       1
2       dog       1
3   dolphin       2
4   dolphin       2
5      frog       2
6      frog       2
7     seq_F       1
8     seq_E       1
9     seq_C       1
10    seq_F       2
11    seq_A       2
12    seq_B       2
13    seq_A       2

But as you can see I cannot manage to take the sharing seq_names between groups and decide which group gain this seq_name according to the value.

Upvotes: 2

Views: 57

Answers (1)

IMCoins
IMCoins

Reputation: 3306

import pandas as pd

df = pd.read_csv('test')
df2 = pd.read_csv('test.csv')
df2 = df2.rename(columns={'col1' : 'species'})
print(df)
#    species seq_names  value
# 0      dog     seq_C   0.67
# 1      cat     seq_F   1.40
# 2      cat     seq_E   0.40
# 3  dolphin     seq_F   0.70
# 4  dolphin     seq_A   1.90
# 5     frog     seq_A   0.80
# 6     frog     seq_B   0.40

print(df2)
#    group_number  species
# 0             1      cat
# 1             1      dog
# 2             2  dolphin
# 3             2     frog

#   We now don't immediatly drop the duplicates, we want to save the merge before.
#   Doing this, we're able to keep the seq_names associated with their group_number.
ndf = df.merge(df2, on='species')\
        .sort_values(by='value', ascending=False)

#   I make a copy so that I get a whole new DataFrame.
#   If I didn't. Changes made to seq_groups would have affected the original.
seq_groups_df = ndf[['seq_names', 'group_number']].copy()
seq_groups_df = seq_groups_df.rename(columns={'seq_names' : 'sp_seq_names'})
print(seq_groups_df)
#   seq_names  group_number
# 4     seq_A             2
# 1     seq_F             1
# 5     seq_A             2
# 3     seq_F             2
# 0     seq_C             1
# 2     seq_E             1
# 6     seq_B             2

ndf = ndf.drop_duplicates(subset='seq_names', keep='first')

#   Either select the interesting columns.
ndf = ndf[['group_number', 'species']]
ndf = ndf.rename(columns={'species' : 'sp_seq_names'})

print(ndf)
#    group_number sp_seq_names
# 4             2      dolphin
# 1             1          cat
# 0             1          dog
# 2             1          cat
# 6             2         frog


result_df = ndf.append(seq_groups_df).reset_index(drop=True)
print(result_df)
#     group_number sp_seq_names
# 0              2      dolphin
# 1              1          cat
# 2              1          dog
# 3              1          cat
# 4              2         frog
# 5              2        seq_A
# 6              1        seq_F
# 7              2        seq_A
# 8              2        seq_F
# 9              1        seq_C
# 10             1        seq_E
# 11             2        seq_B

Upvotes: 2

Related Questions