Filip Szczybura
Filip Szczybura

Reputation: 437

Wrapper function for pandas applymap

I am using sklearn.preprocessing.FunctionTransformer with some custom functions.

def enumerate_virus_scanned(virus_scanned: str) -> int:
    return 1 if not pd.isnull(virus_scanned) else 0

def enumerate_priority(priority: str) -> int:
    try:
        return int(re.search(r'\d+', priority).group(0))
    except (AttributeError, TypeError):
        return 0

def enumerate_encoding(encoding: str) -> int:
    content_transfer_encoding = {
        "na":  0,
        "base64": 1,
        "quoted-printable": 2,
        "8bit": 3,
        "7bit": 4,
        "binary": 5
    }
    try:
        return content_transfer_encoding[encoding.lower()]
    except (AttributeError, KeyError):
        return 0

As you may notice, these functions take a scalar as an input, but in the FunctionTransformer call, the DataFrame is passed as an input. Thus, I need to use the pd.DataFrame.applymap() method for each transformer.

virus_scanned_transformer, priority_transformer, encoding_transformer = (
    FunctionTransformer(lambda df: df.applymap(func)) for func in
    [enumerate_virus_scanned, enumerate_priority, enumerate_encoding]
)

However, this does not work. I do not want to convert the functions to call df.applymap() inside like that:

def enumerate_virus_scanned(df: pd.DataFrame) -> pd.DataFrame:
    return df.applymap(lambda x: 1 if not pd.isnull(x) else 0)

Is there any possibility to create a wrapper with a decorator, that will automatically call df.applymap() inside while calling the function transforming a scalar itself?

def transformer_wrapper(func):
    def wrap(*args, **kwargs):
        return df.applymap(func)
    return wrap

@transformer_wrapper
def enumerate_virus_scanned(virus_scanned: str) -> int:
    return 1 if not pd.isnull(virus_scanned) else 0

Maybe there is a better solution for that?

Upvotes: 0

Views: 62

Answers (1)

Cameron Riddell
Cameron Riddell

Reputation: 13417

Your decorator is fairly close, just need to extract df as the first positional argument:

from functools import wraps
import pandas as pd
from numpy import nan

def applymap_wrap(func):
    @wraps(func)
    def wrapper(df, *args, **kwargs):
        return df.applymap(func, *args, **kwargs)

    return wrapper

@applymap_wrap
def enumerate_virus_scanned(virus_scanned: str) -> int:
    return 1 if not pd.isnull(virus_scanned) else 0

# ---

df = DataFrame({
    "x": [ 10, nan, 20, 30, nan], 
    "y": [nan, nan,  1,  2,   3]
})

print(enumerate_virus_scanned(df))
   x  y
0  1  0
1  0  0
2  1  1
3  1  1
4  0  1

But on the other hand, why not use DataFrame level methods? Using approaches like DataFrame.isnull() is much faster than DataFrame.applymap(lambda x: …)

print(df.notnull().astype(int))
   x  y
0  1  0
1  0  0
2  1  1
3  1  1
4  0  1

Upvotes: 1

Related Questions