Reputation: 977

Pandas flexible determination of metrics

Imagine we have different structures of dataframes in Pandas

# creating the first dataframe 
df1 = pd.DataFrame({
  "width": [1, 5], 
  "height": [5, 8]})

# creating second dataframe
df2 = pd.DataFrame({
  "a": [7, 8], 
  "b": [11, 23],
  "c": [1, 3]})

# creating second dataframe
df3 = pd.DataFrame({
  "radius": [7, 8], 
  "height": [11, 23]})

In general there might be more than 2 dataframes. Now, I want to create a logic that is mapping columns names to specific functions to create a new column "metric" (think of it as area for two columns and volume for 3 columns). I want to specify column names ensembles

column_name_ensembles = {
    "1": {
       "ensemble": ['height', 'width'],
       "method": area},
    "2": {
       "ensemble": ['a', 'b', 'c'],
       "method": volume_cube},
    "3": {
       "ensemble": ['radius', 'height'],
       "method": volume_cylinder}}

def area(width, height):
    return width * height

def volume_cube(a, b, c):
    return a * b * c

def volume_cylinder(radius, height):
    return (3.14159 * radius ** 2) * height

Now, the area function create a new column for the dataframe df1['metric'] = df1['height'] * df2['widht'] and the volumen function will create a new column for the dataframe df2['metic'] = df2['a'] * df2['b'] * df2['c']. Note, that the functions can have arbitrary form but it takes the ensemble as parameters. The desired function metric(df, column_name_ensembles) should take an arbitrary dataframe as input and decide by inspecting the column names which function should be applied.

Example input output behaviour

df1_with_metric = metric(df1, column_name_ensembles)
print(df1_with_metric)
# output
#    width height metric
#  0 1     5      5 
#  1 5     8      40
df2_with_metric = metric(df2, column_name_ensembles)
print(df2_with_metric)
# output
#    a  b  c  metric
#  0 7  11 1  77
#  1 8  23 3  552
df3_with_metric = metric(df3, column_name_ensembles)
print(df3_with_metric)
# output
#    radius  height  metric
#  0 7       11      1693.31701
#  1 8       23      4624.42048

The perfect solution would be a function that takes the dataframe and the column_name_ensembles as parameters and returns the dataframe with the appropriate 'metric' added to it.

I know this can be achieved by multiple if and else statements, but this does not seem to be the most intelligent solution. Maybe there is a design pattern that can solve this problem, but I am not an expert at design patterns.

Thank you for reading my question! I am looking forward for your great answers.

Upvotes: 3

Answers (5)

anky

Reputation: 75090

Here is an interesting way of doing this using pandas methods (Details below)

def metric(dataframe,column_name_ensembles):
    func_df = pd.DataFrame(column_name_ensembles).T
    func_to_apply = func_df.loc[func_df['ensemble'].map(dataframe.columns.difference)
                        .str.len().eq(0),'method'].iat[0]
    return dataframe.assign(metric=dataframe.apply(lambda x: func_to_apply(**x),axis=1))

print(metric(df1,column_name_ensembles),'\n')
print(metric(df2,column_name_ensembles),'\n')
print(metric(df3,column_name_ensembles))

   width  height  metric
0      1       5       5
1      5       8      40 

   a   b  c  metric
0  7  11  1      77
1  8  23  3     552 

   radius  height      metric
0       7      11  1693.31701
1       8      23  4624.42048

More details:

func_df = pd.DataFrame(column_name_ensembles).T

This creates a dataframe of column names and associated methods like below:

          ensemble                                            method
1   [height, width]             <function area at 0x000002809540F9D8>
2         [a, b, c]      <function volume_cube at 0x000002809540F950>
3  [radius, height]  <function volume_cylinder at 0x000002809540FF28>

Using this dataframe , we find the row where difference of column names of the passed dataframe and the list of columns in ensamble is 0 using pd.Index.difference , series.map , series.str.len and series.eq()

func_df['ensemble'].map(df1.columns.difference)

1                     Index([], dtype='object') <- Row matches the df columns completely
2    Index(['height', 'width'], dtype='object')
3              Index(['width'], dtype='object')
Name: ensemble, dtype: object

func_df['ensemble'].map(df1.columns.difference).str.len().eq(0)
1     True
2    False
3    False

Next , where True , we pick the function in the method column

func_df.loc[func_df['ensemble'].map(df1.columns.difference)
                            .str.len().eq(0),'method'].iat[0]
#<function __main__.area(width, height)>

and using apply and df.assign we create a new row with a copy of the passed dataframe returned.

Upvotes: 0

a_guest

Reputation: 36289

You can use the inspect module to extract parameter names automatically and then map frozenset of parameter names to metric functions directly:

import inspect

metrics = {
    frozenset(inspect.signature(f).parameters): f
    for f in (area, volume_cube, volume_cylinder)
}

Then for a given data frame, if all columns are guaranteed to be arguments to the relevant metric, you can simply query that dictionary:

def apply_metric(df, metrics):
    metric = metrics[frozenset(df.columns)]
    args = tuple(df[p] for p in inspect.signature(metric).parameters)
    df['metric'] = metric(*args)
    return df

In case the input data frame has more columns than are required by the metric function you can use set intersection for finding the relevant metric:

def apply_metric(df, metrics):
    for parameters, metric in metrics.items():
        if parameters & set(df.columns) == parameters:
            args = tuple(df[p] for p in inspect.signature(metric).parameters)
            df['metric'] = metric(*args)
            break
    else:
        raise ValueError(f'No metric found for columns {df.columns}')
    return df

Upvotes: 2

villoro

Reputation: 1549

Solution

The idea is to make a function as generic as possible. To do that you should rely on df.apply using axis=1 to apply the function row wise.

The function would be:

def method(df, ensembles):

    # To avoid modifying the original dataframe
    df = df_in.copy()

    for data in ensembles.values():
        if set(df.columns) == set(data["ensemble"]):
            df["method"] = df.apply(lambda row: data["method"](**row), axis=1)
            return df

Why it always works?

This would be posible to apply even for functions that won't work with the whole column.

For example:

df = pd.DataFrame({
    "a": [1, 2], 
    "b": [[1, 2], [3, 4]],
})

def a_in_b(a, b):
    return a in b

# This will work
df.apply(lambda row: a_in_b(**row), axis=1)

# This won't
a_in_b(df["a"], df["b"])

Upvotes: 0

chaooder

Reputation: 1506

def metric(df, column_name_ensembles):

    df_cols_set = set(df.columns)
    # if there is a need to overwrite the previously calculated 'metric' column
    df_cols_set.discard('metric')

    for column_name_ensemble in column_name_ensembles.items():

        # pick up the first `column_name_ensemble` dictionary 
        # with 'ensemble' matching the df columns 
        # (excluding 'metric' column, if present)
        # comparing `set` if order of column names 
        # in ensemble does not matter (as per your df1 example), 
        # else can compare `list`
        if df_cols_set == set(column_name_ensemble[1]['ensemble']):
            df['metric'] = column_name_ensemble[1]['method'](**{col: df[col] for col in df_cols_set})
            break

    # if there is a match, return df with 'metric' calculated
    # else, return original df untouched
    return df

Upvotes: 0

davidkunio

Reputation: 123

The function that runs the model should be a fairly flexible apply. Assuming the calculations will always be limited to the data in a single row, this would probably work.

First, I modified the functions to use a common input. I added a triangle area calc to be sure this was extensible.

#def area(width, height):
#    return width * height

def area(row):
    return row['width'] * row['height']

#def volume_cube(a, b, c):
#    return a * b * c

def volume_cube(row):
    return row['a'] * row['b'] * row['c']

#def volume_cylinder(radius, height):
#    return (3.14159 * radius ** 2) * height

def volume_cylinder(row):
    return (3.14159 * row['radius'] ** 2) * row['height']

def area_triangle(row):
    return 0.5 * row['width'] * row['height']

This allows us to use the same apply for all of the functions. Because I'm a bit ocd, I changed the names of keys in the reference dictionary.

column_name_ensembles = {
    "area": {
       "ensemble": ['width', 'height'],
       "method": area},
    "volume_cube": {
       "ensemble": ['a', 'b', 'c'],
       "method": volume_cube},
    "volume_cylinder": {
       "ensemble": ['radius', 'height'],
       "method": volume_cylinder},
    "area_triangle": {
       "ensemble": ['width', 'height'],
       "method": area_triangle},
    }

The metric function then is an apply to the df. You have to specify the function you are targeting in this version, but you could infer the ensemble method based on the columns. This version makes sure the required columns are available.

def metric(df,method_id):
    source_columns = list(df.columns)
    calc_columns = column_name_ensembles[method_id]['ensemble']
    if all(factor in source_columns for factor in calc_columns):
        df['metric'] = df.apply(lambda row: column_name_ensembles[method_id]['method'](row),axis=1)
        return df
    else:
        print('Column Mismatch')

You can then specify the dataframe and the ensemble method.

df1_with_metric = metric(df1,'area')
df2_with_metric = metric(df2,'volume_cube')
df3_with_metric = metric(df3,'volume_cylinder')
df1_with_triangle_metric = metric(df1,'area_triangle')

Upvotes: 0

Pandas flexible determination of metrics

Answers (5)

Solution

Why it always works?

Related Questions