Reputation: 305
I have 2 dataframes and I would like to use a loop to do the following.
For example: If analyzing Clust 1, in df, only the index, G-pairs, G_A, G_B, and Clust 1 should remain. In df2 only Clust _1_n & clust_1_l should remain.
Go through every row of df and df2, if there is a ID match between G_A in df and clust_1_n in df2, then add a new column to df titled 'G_A_val' and enter the associated Clust_1_l Value
Go through every row of df and df2, if there is a ID match between G_B in df and clust_1_n in df2, then add a new column to df titled 'G_B_val' and enter the associated Clust_1_l Value
Add a new column to df titled sums which contains the sum of G_A_val & G_B_val for each row
Lastly, apply a condition such that if a row in the Clust 1 column is >= 0.051 then replace the sums column value for that row in df with 0
** I can perform these tasks without loops but I'm trying to learn how to use loops in dataframes.
These are the dataframes
import pandas as pd
# initialise data of lists.
data = {'g_pairs':['an_jk', 'tf_ha', 'ab_rh', 'et_x2','yr_po'],
'g_a':['en762','en72b','en925','en980','en009'],
'g_b':['en361','en231','en666','en771','en909'],
'clust1':[0.020,1,0.05,0.7,0.001],
'clust2':[1,1,0.1,0.0001,1],
'clust3':[0.001,0.002,0.0021,0.3,0.005],
'clust4':[0.0002,0.0043,0.0067,0.0123,0.0110],
'clust5':[0.7500,0.0540,0.0030,1,0.0081]}
# Creates pandas DataFrame.
df = pd.DataFrame(data)
# print the data
df
g_pairs g_a g_b clust1 clust2 clust3 clust4 clust5
0 an_jk en762 en361 0.020 1.0000 0.0010 0.0002 0.7500
1 tf_ha en72b en231 1.000 1.0000 0.0020 0.0043 0.0540
2 ab_rh en925 en666 0.050 0.1000 0.0021 0.0067 0.0030
3 et_x2 en980 en771 0.700 0.0001 0.3000 0.0123 1.0000
4 yr_po en009 en909 0.001 1.0000 0.0050 0.0110 0.0081
data2 = {'clust_1n':['en762','en72b','en925','en980','en009','en361','en231','en666','en771','en909'],
'clust_1l':[0.35,0.30,0.004,0.58,0.55,0.94,0.78,0.62,0.366,0.01],
'clust_2n':['en762','en72b','en925','en980','en009','en361','en231','en666','en771','en909'],
'clust_2l':[0.37,0.0130,0.385,0.57,0.6,0.19,0.41,0.92,0.46,0.44],
'clust_3n':['en762','en72b','en925','en980','en009','en361','en231','en666','en771','en909'],
'clust_3l':[0.29,0.01130,0.2385,0.457,0.16,0.149,0.411,0.992,0.146,0.944],
'clust_4n':['en762','en72b','en925','en980','en009','en361','en231','en666','en771','en909'],
'clust_4l':[0.529,0.0001130,0.12385,0.6457,0.816,0.9149,0.3411,0.0992,0.3146,0.0944],
'clust_5n':['en762','en72b','en925','en980','en009','en361','en231','en666','en771','en909'],
'clust_5l':[0.229,0.001130,0.12385,0.4457,0.616,0.1549,0.4911,0.0992,0.9146,0.2944]}
# Creates pandas DataFrame.
df2 = pd.DataFrame(data2)
# print the data
df2
clust_1n clust_1l clust_2n clust_2l clust_3n clust_3l clust_4n clust_4l clust_5n clust_5l
0 en762 0.350 en762 0.370 en762 0.2900 en762 0.529000 en762 0.22900
1 en72b 0.300 en72b 0.013 en72b 0.0113 en72b 0.000113 en72b 0.00113
2 en925 0.004 en925 0.385 en925 0.2385 en925 0.123850 en925 0.12385
3 en980 0.580 en980 0.570 en980 0.4570 en980 0.645700 en980 0.44570
4 en009 0.550 en009 0.600 en009 0.1600 en009 0.816000 en009 0.61600
5 en361 0.940 en361 0.190 en361 0.1490 en361 0.914900 en361 0.15490
6 en231 0.780 en231 0.410 en231 0.4110 en231 0.341100 en231 0.49110
7 en666 0.620 en666 0.920 en666 0.9920 en666 0.099200 en666 0.09920
8 en771 0.366 en771 0.460 en771 0.1460 en771 0.314600 en771 0.91460
9 en909 0.010 en909 0.440 en909 0.9440 en909 0.094400 en909 0.29440
This is what the output should look like
g_pairs g_a g_b clust1 g_a_val g_b_val sums
0 an_jk en762 en361 0.020 0.350 0.940 1.29
1 tf_ha en72b en231 1.000 0.300 0.780 0.00
2 ab_rh en925 en666 0.050 0.004 0.620 0.97
3 et_x2 en980 en771 0.700 0.580 0.366 0.00
4 yr_po en009 en909 0.001 0.550 0.010 0.56
Thank you very much in advance.
Upvotes: 0
Views: 140
Reputation: 11321
EDIT: Sorry, I have missed
"I can perform these tasks without loops but I'm trying to learn how to use loops in dataframes."
Does this mean my first answer below is useless? :)) You shouldn't loop over DataFrames if you can avoid it! But if you really want to do it, this would be one way:
cluster = 1
df_res = df[["g_pairs", "g_a", "g_b", f"clust{cluster}"]].assign(g_a_val=0, g_b_val=0, sum=0)
for i, g_a, g_b, clust in df[["g_a", "g_b", f"clust{cluster}"]].itertuples():
for clust_n, clust_l in df2[[f"clust_{cluster}n", f"clust_{cluster}l"]].itertuples(index=False):
if g_a == clust_n:
g_a_val = clust_l
df_res.loc[i, "g_a_val"] = g_a_val
break
for clust_n, clust_l in df2[[f"clust_{cluster}n", f"clust_{cluster}l"]].itertuples(index=False):
if g_b == clust_n:
g_b_val = clust_l
df_res.loc[i, "g_b_val"] = g_b_val
break
if clust < 0.51:
df_res.loc[i, "sum"] = g_a_val + g_b_val
You could do the following:
cluster = 1
df_res = (
df[["g_pairs", "g_a", "g_b", f"clust{cluster}"]]
.merge(
df2[[f"clust_{cluster}n", f"clust_{cluster}l"]],
left_on="g_a",
right_on=f"clust_{cluster}n",
how="left"
)
.drop(columns=f"clust_{cluster}n")
.rename(columns={f"clust_{cluster}l": "g_a_val"})
.merge(
df2[[f"clust_{cluster}n", f"clust_{cluster}l"]],
left_on="g_b",
right_on=f"clust_{cluster}n",
how="left"
)
.drop(columns=f"clust_{cluster}n")
.rename(columns={f"clust_{cluster}l": "g_b_val"})
)
df_res["sum"] = df_res.g_a_val + df_res.g_b_val
df_res.loc[df[f"clust{cluster}"] >= 0.51, "sum"] = 0
Result:
g_pairs g_a g_b clust1 g_a_val g_b_val sum
0 an_jk en762 en361 0.020 0.350 0.940 1.290
1 tf_ha en72b en231 1.000 0.300 0.780 0.000
2 ab_rh en925 en666 0.050 0.004 0.620 0.624
3 et_x2 en980 en771 0.700 0.580 0.366 0.000
4 yr_po en009 en909 0.001 0.550 0.010 0.560
Upvotes: 1