Reputation: 615
I am currently trying to implement a statistical test for a specific row based on the content of different rows. Given the dataframe in the following image:
DataFrame I would like to create a new column based on a function that takes into account all the columns of the dataframe that has the same string in column "Template".
For example, in this case there are 2 rows with Template "[Are|Off]", and for each one of those rows I would need to create an element in a new column based on "Clicks", "Impressions" and "Conversions" of both rows.
How would you best approach this problem?
PS: I apologise in advance for the way I am describing the problem, as you might have notices I am not a professional codes :D But I would really appreciate your help!
Here the formula with which I solved this in excel:
Upvotes: 0
Views: 1776
Reputation: 184
This might be overly general but I would use some sort of function map if different things should be done depending on the template name:
import pandas as pd
import numpy as np
import collections
n = 5
template_column = list(['are|off', 'are|off', 'comp', 'comp', 'comp|city'])
n = len(template_column)
df = pd.DataFrame(np.random.random((n, 3)), index=range(n), columns=['Clicks', 'Impressions', 'Conversions'])
df['template'] = template_column
# Use a defaultdict so that you can define a default value if a template is
# note defined
function_map = collections.defaultdict(lambda: lambda df: np.nan)
# Now define functions to compute what the new columns should do depending on
# the template.
function_map.update({
'are|off': lambda df: df.sum().sum(),
'comp': lambda df: df.mean().mean(),
'something else': lambda df: df.mean().max()
})
# The lambda functions are just placeholders. You could do whatever you want in these functions... for example:
def do_special_stuff(df):
"""Do something that uses rows and columns...
you could also do looping or whatever you want as long
as the result is a scalar, or a sequence with the same
number of columns as the original template DataFrame
"""
crazy_stuff = np.prod(np.sum(df.values,axis=1)[:,None] + 2*df.values, axis=1)
return crazy_stuff
function_map['comp'] = do_special_stuff
def wrap(f):
"""Wrap a function so that it returns an updated dataframe"""
def wrapped(df):
df = df.copy()
new_column_data = f(df.drop('template', axis=1))
df['new_column'] = new_column_data
return df
return wrapped
# wrap all the functions so that each template has a function defined that does
# the correct thing
series_function_map = {k: wrap(function_map[k]) for k in df['template'].unique()}
# throw everything back together
new_df = pd.concat([series_function_map[label](group)
for label, group in df.groupby('template')],
ignore_index=True)
# print your shiny new dataframe
print(new_df)
The result is then something like:
Clicks Impressions Conversions template new_column
0 0.959765 0.111648 0.769329 are|off 4.030594
1 0.809917 0.696348 0.683587 are|off 4.030594
2 0.265642 0.656780 0.182373 comp 0.502015
3 0.753788 0.175305 0.978205 comp 0.502015
4 0.269434 0.966951 0.478056 comp|city NaN
Hope it helps!
Upvotes: 3
Reputation: 200
Ok so after groupby u need to apply this formula ..so you can do this in pandas also ...
import numpy as np
t = df.groupby("Template") # this is for groupby
def calculater(b5,b6,c5,c6):
return b5/(b5+b6)*((c5+c6))
t['result'] = np.vectorize(calculater)(df["b5"],df["b6"],df["c5"],df["c6"])
here b5,b6 .. are column names of the cells shown in image
This should work for you or may need to do some minor changes in maths there
Upvotes: 2