SirAchesis
SirAchesis

Reputation: 345

Is there a more optimal way to apply these function to a dataframe

I need to apply a function to two columns in a dataframe.
The idea of the function is to split the value on each row of that column and then turn the split values into ints.

There are two types of values:

  1. Dates as strings (e.g "20.11.2020")
  2. Lists of numbers as strings (e.g "20,11,49,19,2")

The current way I achieve this is by doing:

def numerize_c(row):
    """
    Delim is colon
    """
    return [int(num) for num in row.split(",")]
    
def numerize_d(row):
    """
    Delim is dot
    """
    return [int(num) for num in row.split(".")]

data["corr_num"] = data["corr_num"].apply(numerize_c)
data["game_date"] = data["game_date"].apply(numerize_d)

I feel like this is a terribly inefficient way to do this. Is there a way, to for example give the functions an arg for the delimiters.

Or is there a way to format this into a lambda?

Upvotes: 0

Views: 46

Answers (2)

SCKU
SCKU

Reputation: 833

You could use pd.DataFrame.apply, pd.Series.str.split with regular expressions to split '.' or ',' all at once.

data.loc[:, ["corr_num", "game_date"]] =\
     data[["corr_num", "game_date"]].apply(lambda x: x.str.split(r',|\.'))

Upvotes: 1

Matthew Hamilton
Matthew Hamilton

Reputation: 106

An improvement would be to use data['corr_num'].str.split(','). This built-in is much faster than apply.

Upvotes: 1

Related Questions