cool_beans
cool_beans

Reputation: 141

Getting the average from a dataframe consisting of pandas timestamp

I have two panda series, closedDate and createdDate, thats elements are pandas timestamps, class 'pandas._libs.tslib.Timestampclass 'pandas._libs.tslib.Timestamp.

I have subtracted those two panda series to make a list, age, of pandas timedelta.

closedDate = data.iloc[:,1]
createdDate = data.iloc[:,2]
age = [x-y for x,y in zip(closedDate, createdDate)]

Now, I want to get the average of age but with my line of code I get an error.

In: average_age = functools.reduce(lambda x, y: x + y, age) / len(age)

Out: OverflowError: int too big to convert

How can I fix this??

Thanks!

Upvotes: 1

Views: 338

Answers (2)

M Sandler
M Sandler

Reputation: 11

You should do this with series like jpp suggested, but barring that, heres how you could do it with the list you provided.

average_age = sum(age, timedelta()) / len(age)

Upvotes: 0

jpp
jpp

Reputation: 164623

You can and should aim to use vectorised functions for this task.

In this example, you can subtract one pd.Series from another. You can then use the mean method to calculate the average.

data = pd.DataFrame({'createdDate': [pd.Timestamp('2018-01-01'),
                                     pd.Timestamp('2018-05-01')],
                     'closedDate': [pd.Timestamp('2018-02-01'),
                                    pd.Timestamp('2018-07-01')]})

closedDate = data['closedDate']
createdDate = data['createdDate']

ages = closedDate - createdDate

print(ages)

# 0   31 days
# 1   61 days
# dtype: timedelta64[ns]

res = ages.mean()

print(res)

# 46 days 00:00:00

There are 2 main reasons why using vectorised functions are better in this case:

  1. Underlying pd.Timestamp are numeric arrays (source).
  2. zip on arrays underperforms versus zip on lists (source).

Upvotes: 2

Related Questions