Reputation: 31
I have a table (df1) with a list of values (neig_list, which is a python list) in each row
ID | neig_list |
---|---|
1 | a, b, d |
2 | b, e, f, g, h |
3 | b, a, j, k |
And a table (df2) with entries for those values
neig | samples | samples_indicator |
---|---|---|
'a' | 3 | 0.5 |
'a' | 5 | 0.1 |
'b' | 1 | 0.2 |
'c' | 15 | 0.5 |
'd' | 12 | 0.3 |
'a' | 2 | 1 |
'e' | 5 | 0.6 |
'f' | 6 | 0 |
'h' | 6 | 0.5 |
I need to add a column to df1 getting, for each row, the result for the sum of samples x samples_indicator for all neigs that are contained in the neig_list for that row.
For example, for the first row, we would have:
3*0.5 + 5*0.1 + 1*0.2 + 12*0.3 + 2*1 = 7.8
ID | neig_list | new_column |
---|---|---|
1 | a, b, d | 7.8 |
2 | b, e, f, g, h | value |
3 | b, a, j, k | value |
Actually, the function is more complicated than that (involves more columns), so ideally I'd like to have a separate function and then apply it to df1, based on df2.
Upvotes: 0
Views: 1431
Reputation: 9379
Here's a way to do what your question asks:
def foo(df1, df2):
return (df1
.join(df1.assign(neig=df1.neig_list).explode('neig')
.join(
df2.assign(new_column=df2.samples * df2.samples_indicator)[['neig','new_column']].groupby('neig').sum(),
on='neig')
.drop(columns=['neig','neig_list']).groupby('ID').sum(),
on='ID')
)
print(foo(df1, df2))
Output:
ID neig_list new_column
0 1 [a, b, d] 7.8
1 2 [b, e, f, g, h] 6.2
2 3 [b, a, j, k] 4.2
Explanation:
assign()
to add new_column
as a column to df2
which, using groupby()
and sum()
, gets populated with the dot-product of samples
and samples_indicator
for the rows in each neig
groupassign()
to clone the neig_list
column of df1
as neig
and explode()
to expand each row to one row per item in the neig
columnjoin()
on the above two DataFrame objects to put sample results from new_column
into each row based on its neig
valuejoin()
again with the above DataFrame object (after dropping the neig
and neig_list
columns) to add the desired column to the original df1
.Upvotes: 1
Reputation: 1750
You can just define a function that performs the calculations for a given list of neig
s using df2
and then just apply it to neig_list
in df1
:
def result(row):
return sum([df2['samples'][item]*df2['samples_indicator'][item] for item in row])
df1['new_column'] = df1['neig_list'].apply(result)
Note that this requires neig
to be the index in df2. If it's not, you can do df2.set_index('neig', inplace=True)
or, if you don't want to modify d2:
def result(row):
return sum([df2.set_index('neig')['samples'][item]*df2.set_index('neig')['samples_indicator'][item] for item in row])
and apply it the same way as before.
Upvotes: 1
Reputation: 59274
Calculate first your math in df2
:
map_ = df2.assign(neig = df2['neig'].str.strip("'"),
calculated = lambda df: df['samples'] * df['samples_indicator'])\
.groupby('neig')['calculated'].sum()
Then, explode
your first df, and map the values above for 'a'
, 'b'
etc with the calculated formula. Finally, groupby
and sum
:
df['new_column'] = df['neig_list'].str.split(', ').explode()\
.map(map_)\
.groupby(level=0)\
.sum()
ID neig_list new_column
0 1 a, b, d 7.8
1 2 b, e, f, g, h 6.2
2 3 b, a, j, k 4.2
Upvotes: 1