gowerc
gowerc

Reputation: 1099

What is the pythonic way of creating a pandas variable from multiple other variables

I am a R programmer currently trying to learn Python / Pandas. Currently I am trying to grapple with how to clearly and cleanly create a new variable from a function that uses multiple existing variables.

Note that the function used in my example isn't that complex but I am trying to generalise to the case of an arbitrary function that could be significantly more complex or require more variables, that is to say I am trying to avoid solutions that are optimised for this specific function and more looking how to handle the general scenario.

For reference this is an example of how I would do this in R.

library(tidyverse)

df <- data_frame(
    num = c(15, 52 , 24 , 29),
    cls = c("a" , "b" , "b", "a")
)

attempt1 <- function( num , cls){
    if ( cls == "a") return( num + 10)
    if ( cls == "b") return( num - 10)
}

## Example 1
df %>% 
    mutate( num2 = map2_dbl( num , cls , attempt1))

## Example 2
df %>% 
    mutate( num = ifelse( num <= 25 , num + 10 , num)) %>% 
    mutate( num2 = map2_dbl( num , cls , attempt1))

Reading the pandas documentation as well as various SO posts I have found multiple ways of achieving this in python, however none of them sit well with me. For reference I've posted my current 3 solutions below:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "num" : [14, 52 , 24 , 29],
    "cls" : ["a" , "b" , "b" ,"a"]
})

### Example 1

def attempt1( num, cls):
    if cls == "a":
        return num + 10
    if cls == "b":
        return num - 10

df.assign( num2 = df.apply( lambda x: attempt1(x["num"] , x["cls"]) , axis = 1))


def attempt2( df):
    if df["cls"] == "a":
        return df["num"] + 10
    if df["cls"] == "b":
        return df["num"] - 10

df.assign( num2 = df.apply(attempt2, axis=1))



def attempt3(df):
    df["num2"] = attempt1(df["num"], df["cls"])
    return df

df.apply( attempt3 , axis = 1)



### Example 2

df.assign( num = np.where( df["num"] <= 25 , df["num"] + 10 , df["num"]))\
    .apply( attempt3 , axis = 1)

My issue with attempt 1 is that it appears to be quite horribly verbose. In addition you need to self reference back to your starting dataset which means that if you wanted to chain multiple derivations together you would have to write out your dataset to intermediate variables even if you had no intention of keeping it.

Attempt2 has significantly cleaner syntax but still suffers from the intermediate variable problem. Another issue is that the function expects a dataframe which makes the function harder to unittest, less flexible and less clear on what the inputs should be.

Attempt3 seems to be the best to me in terms of functionality as it provides you with a clear testable function and doesn't require the saving of intermediate datasets. The major downside being that you now have to have 2 functions which feels like redundant code.

Any help or advice would be greatly appreciated.

Upvotes: 3

Views: 170

Answers (2)

Guybrush
Guybrush

Reputation: 2790

You can rely on Series.where to do the job, by creating a column that contains 10, and changing it to -10 depending on the value of cls. You can then use that column to perform an arithmetic operation like you want.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.where.html

Step by step (verbose) example:

df['what_to_add'] = 10
df['what_to_add'] = df['what_to_add'].where(df['cls'] == 'a', -10)
df['num'] = df['num'] + df['what_to_add']

Another possibility given that your two numbers are opposite is to define a column for the sign of the operand:

df['sign'] = 1 - 2 * (df['cls'] == 'a').astype(int)
df['num'] = df['num'] + df['sign'] * 10

A third way to do that is to use replace, so that you replace "a" by 10 and "b" by -10:

df['what_to_add'] = df['cls'].replace(['a', 'b'], [10, -10])
df['num'] = df['num'] + df['what_to_add']

edited: Or, as proposed by JPP (https://stackoverflow.com/a/49748695/4582949), using map:

df['num2'] += df['cls'].map({'a': 10, 'b': -10})

Upvotes: 1

jpp
jpp

Reputation: 164773

One efficient method is to use pd.Series.map:

df['num2'] += df['cls'].map({'a': 10, 'b': -10})

This uses a dictionary to map values of cls to either 10 or -10.

There are many other methods (see @Guybrush's answer), but the dictionary-based method is extendable and efficient for larger dataframes. In my opinion, it is also readable.

Relevant: Replace values in a pandas series via dictionary efficiently

Upvotes: 1

Related Questions