Sam Ezebunandu
Sam Ezebunandu

Reputation: 1

Using Pandas .apply() method with a regex-based function

I am trying to create a new column in a data Frame by applying a function on a column that has numbers as strings.

I have written the function to extract the numbers I want and tested it on a single string input and can confirm that it works.

SEARCH_PATTERN = r'([0-9]{1,2}) ([0-9]{2}):([0-9]{2}):([0-9]{2})'
def get_total_time_minutes(time_col, pattern=SEARCH_PATTERN):
    """Uses regex to parse time_col which is a string in the format 'd hh:mm:ss' to
    obtain a total time in minutes
    """
    days, hours, minutes, _ = re.match(pattern, time_col).groups()
    total_time_minutes = (int(days)*24 + int(hours))*60 + int(minutes)
    return total_time_minutes

#test that the function works for a single input
text = "2 23:24:46"
print(get_total_time_minutes(text))

Ouput: 4284

#apply the function to the required columns
df['Minutes Available'] = df['Resource available (d hh:mm:ss)'].apply(get_total_time_minutes)

The picture below is a screenshot of my dataframe columns. Screenshot of my dataframe The 'Resources available (d hh:mm:ss)' column of my dataframe is of Pandas type 'o' (string, if my understanding is correct), and has data in the following format: '5 08:00:00'. When I call the apply(get_total_time_minutes) on it though, I get the following error:

TypeError: expected string or bytes-like object

To clarify further, the "Resources Available" column is a string representing the total time in days, hours, minutes and seconds that the resource was available. I want to convert that time string to a total time in minutes, hence the regex and arithmetic within the get_total_time_minutes function. – Sam Ezebunandu just now

Upvotes: 0

Views: 88

Answers (1)

ifly6
ifly6

Reputation: 5331

This might be a bit hacky, because it uses the datetime library to parse the date and then turn it into a Timedelta by subtracting the default epoch:

>>> pd.to_datetime('2 23:48:30', format='%d %H:%M:%S') - pd.to_datetime('0', format='%S')
Out[47]: Timedelta('1 days 23:48:30')

>>> Out[47] / pd.Timedelta('1 minute')
Out[50]: 2868.5

But it does tell you how many minutes elapsed in those two days and however many hours. It's also vectorised, so you can apply it to the columns and get your minute values a lot faster than using apply.

Upvotes: 1

Related Questions