Reputation: 977
Imagine we have different structures of dataframes in Pandas
# creating the first dataframe
df1 = pd.DataFrame({
"width": [1, 5],
"height": [5, 8]})
# creating second dataframe
df2 = pd.DataFrame({
"a": [7, 8],
"b": [11, 23],
"c": [1, 3]})
# creating second dataframe
df3 = pd.DataFrame({
"radius": [7, 8],
"height": [11, 23]})
In general there might be more than 2 dataframes. Now, I want to create a logic that is mapping columns names to specific functions to create a new column "metric" (think of it as area for two columns and volume for 3 columns). I want to specify column names ensembles
column_name_ensembles = {
"1": {
"ensemble": ['height', 'width'],
"method": area},
"2": {
"ensemble": ['a', 'b', 'c'],
"method": volume_cube},
"3": {
"ensemble": ['radius', 'height'],
"method": volume_cylinder}}
def area(width, height):
return width * height
def volume_cube(a, b, c):
return a * b * c
def volume_cylinder(radius, height):
return (3.14159 * radius ** 2) * height
Now, the area function create a new column for the dataframe df1['metric'] = df1['height'] * df2['widht']
and the volumen function will create a new column for the dataframe df2['metic'] = df2['a'] * df2['b'] * df2['c']
. Note, that the functions can have arbitrary form but it takes the ensemble as parameters. The desired function metric(df, column_name_ensembles)
should take an arbitrary dataframe as input and decide by inspecting the column names which function should be applied.
Example input output behaviour
df1_with_metric = metric(df1, column_name_ensembles)
print(df1_with_metric)
# output
# width height metric
# 0 1 5 5
# 1 5 8 40
df2_with_metric = metric(df2, column_name_ensembles)
print(df2_with_metric)
# output
# a b c metric
# 0 7 11 1 77
# 1 8 23 3 552
df3_with_metric = metric(df3, column_name_ensembles)
print(df3_with_metric)
# output
# radius height metric
# 0 7 11 1693.31701
# 1 8 23 4624.42048
The perfect solution would be a function that takes the dataframe and the column_name_ensembles as parameters and returns the dataframe with the appropriate 'metric' added to it.
I know this can be achieved by multiple if and else statements, but this does not seem to be the most intelligent solution. Maybe there is a design pattern that can solve this problem, but I am not an expert at design patterns.
Thank you for reading my question! I am looking forward for your great answers.
Upvotes: 3
Views: 1542
Reputation: 75090
Here is an interesting way of doing this using pandas methods (Details below)
def metric(dataframe,column_name_ensembles):
func_df = pd.DataFrame(column_name_ensembles).T
func_to_apply = func_df.loc[func_df['ensemble'].map(dataframe.columns.difference)
.str.len().eq(0),'method'].iat[0]
return dataframe.assign(metric=dataframe.apply(lambda x: func_to_apply(**x),axis=1))
print(metric(df1,column_name_ensembles),'\n')
print(metric(df2,column_name_ensembles),'\n')
print(metric(df3,column_name_ensembles))
width height metric
0 1 5 5
1 5 8 40
a b c metric
0 7 11 1 77
1 8 23 3 552
radius height metric
0 7 11 1693.31701
1 8 23 4624.42048
More details:
func_df = pd.DataFrame(column_name_ensembles).T
This creates a dataframe of column names and associated methods like below:
ensemble method
1 [height, width] <function area at 0x000002809540F9D8>
2 [a, b, c] <function volume_cube at 0x000002809540F950>
3 [radius, height] <function volume_cylinder at 0x000002809540FF28>
Using this dataframe , we find the row where difference of column names of the passed dataframe and the list of columns in ensamble is 0 using pd.Index.difference
, series.map
, series.str.len
and series.eq()
func_df['ensemble'].map(df1.columns.difference)
1 Index([], dtype='object') <- Row matches the df columns completely
2 Index(['height', 'width'], dtype='object')
3 Index(['width'], dtype='object')
Name: ensemble, dtype: object
func_df['ensemble'].map(df1.columns.difference).str.len().eq(0)
1 True
2 False
3 False
Next , where True , we pick the function in the method
column
func_df.loc[func_df['ensemble'].map(df1.columns.difference)
.str.len().eq(0),'method'].iat[0]
#<function __main__.area(width, height)>
and using apply
and df.assign
we create a new row with a copy of the passed dataframe returned.
Upvotes: 0
Reputation: 36289
You can use the inspect
module to extract parameter names automatically and then map frozenset
of parameter names to metric functions directly:
import inspect
metrics = {
frozenset(inspect.signature(f).parameters): f
for f in (area, volume_cube, volume_cylinder)
}
Then for a given data frame, if all columns are guaranteed to be arguments to the relevant metric, you can simply query that dictionary:
def apply_metric(df, metrics):
metric = metrics[frozenset(df.columns)]
args = tuple(df[p] for p in inspect.signature(metric).parameters)
df['metric'] = metric(*args)
return df
In case the input data frame has more columns than are required by the metric function you can use set intersection for finding the relevant metric:
def apply_metric(df, metrics):
for parameters, metric in metrics.items():
if parameters & set(df.columns) == parameters:
args = tuple(df[p] for p in inspect.signature(metric).parameters)
df['metric'] = metric(*args)
break
else:
raise ValueError(f'No metric found for columns {df.columns}')
return df
Upvotes: 2
Reputation: 1549
The idea is to make a function as generic as possible. To do that you should rely on df.apply
using axis=1
to apply the function row wise.
The function would be:
def method(df, ensembles):
# To avoid modifying the original dataframe
df = df_in.copy()
for data in ensembles.values():
if set(df.columns) == set(data["ensemble"]):
df["method"] = df.apply(lambda row: data["method"](**row), axis=1)
return df
This would be posible to apply even for functions that won't work with the whole column.
For example:
df = pd.DataFrame({
"a": [1, 2],
"b": [[1, 2], [3, 4]],
})
def a_in_b(a, b):
return a in b
# This will work
df.apply(lambda row: a_in_b(**row), axis=1)
# This won't
a_in_b(df["a"], df["b"])
Upvotes: 0
Reputation: 1506
def metric(df, column_name_ensembles):
df_cols_set = set(df.columns)
# if there is a need to overwrite the previously calculated 'metric' column
df_cols_set.discard('metric')
for column_name_ensemble in column_name_ensembles.items():
# pick up the first `column_name_ensemble` dictionary
# with 'ensemble' matching the df columns
# (excluding 'metric' column, if present)
# comparing `set` if order of column names
# in ensemble does not matter (as per your df1 example),
# else can compare `list`
if df_cols_set == set(column_name_ensemble[1]['ensemble']):
df['metric'] = column_name_ensemble[1]['method'](**{col: df[col] for col in df_cols_set})
break
# if there is a match, return df with 'metric' calculated
# else, return original df untouched
return df
Upvotes: 0
Reputation: 123
The function that runs the model should be a fairly flexible apply. Assuming the calculations will always be limited to the data in a single row, this would probably work.
First, I modified the functions to use a common input. I added a triangle area calc to be sure this was extensible.
#def area(width, height):
# return width * height
def area(row):
return row['width'] * row['height']
#def volume_cube(a, b, c):
# return a * b * c
def volume_cube(row):
return row['a'] * row['b'] * row['c']
#def volume_cylinder(radius, height):
# return (3.14159 * radius ** 2) * height
def volume_cylinder(row):
return (3.14159 * row['radius'] ** 2) * row['height']
def area_triangle(row):
return 0.5 * row['width'] * row['height']
This allows us to use the same apply for all of the functions. Because I'm a bit ocd, I changed the names of keys in the reference dictionary.
column_name_ensembles = {
"area": {
"ensemble": ['width', 'height'],
"method": area},
"volume_cube": {
"ensemble": ['a', 'b', 'c'],
"method": volume_cube},
"volume_cylinder": {
"ensemble": ['radius', 'height'],
"method": volume_cylinder},
"area_triangle": {
"ensemble": ['width', 'height'],
"method": area_triangle},
}
The metric function then is an apply to the df. You have to specify the function you are targeting in this version, but you could infer the ensemble method based on the columns. This version makes sure the required columns are available.
def metric(df,method_id):
source_columns = list(df.columns)
calc_columns = column_name_ensembles[method_id]['ensemble']
if all(factor in source_columns for factor in calc_columns):
df['metric'] = df.apply(lambda row: column_name_ensembles[method_id]['method'](row),axis=1)
return df
else:
print('Column Mismatch')
You can then specify the dataframe and the ensemble method.
df1_with_metric = metric(df1,'area')
df2_with_metric = metric(df2,'volume_cube')
df3_with_metric = metric(df3,'volume_cylinder')
df1_with_triangle_metric = metric(df1,'area_triangle')
Upvotes: 0