Esme_
Esme_

Reputation: 1520

Iterate over pandas dataframe while updating values

I've looked through a bunch of similar questions, but I cannot figure out how to actually apply the principles to my own case. I'm therefore trying to figure out a simple example I can work from - basically I need the idiots' guide before I can look at more complex examples

Consider a dataframe that contains a list of names and times, and a known start time. I then want to update the dataframe with the finish time, which is calculated from starttime + Time

import pandas as pd
import datetime

df = pd.DataFrame({"Name": ["Kate","Sarah","Isabell","Connie","Elsa","Anne","Lin"],
                  "Time":[3, 6,1, 7, 23,3,4]})
starttime = datetime.datetime.strptime('2020-02-04 00:00:00', '%Y-%m-%d %H:%M:%S')

I know that for each case I can calculate the finish time using

finishtime = starttine + datetime.datetime.timedelta(minutes = df.iloc[0,1])

what I can't figure out is how to use this while iterating over the df rows and updating a third column in the dataframe with the output.

I tried

df["FinishTime"] = np.nan

for row in df.itertuples():
    df.at[row,"FinishTime"] = starttine + datetime.datetime.timedelta(minutes = row.Time)

but it gave a lot of errors I couldn't unravel. How am I meant to do this?

I am aware that the advice to iterating over a dataframe is don't - I'm not committed to iterating, I just need some way to calculate that final column and add it to the dataframe. My real data is about 200k lines.

Upvotes: 1

Views: 98

Answers (1)

RichieV
RichieV

Reputation: 5183

Use pd.to_timedelta()

import datetime
starttime = datetime.datetime.strptime('2020-02-04 00:00:00', '%Y-%m-%d %H:%M:%S')
df = pd.DataFrame({"Name": ["Kate","Sarah","Isabell","Connie","Elsa","Anne","Lin"],
    "Time":[3, 6,1, 7, 23,3,4]})
df.Time = pd.to_timedelta(df.Time, unit='m')
# df = df.assign(FinishTime = df.Time + starttime)
df['FinishTime'] = df.Time + starttime # as pointed out by Trenton McKinney, .assign() is only one way to create new columns
# creating with df['new_col'] has the benefit of not having to copy the full df

print(df)

Output

      Name     Time          FinishTime
0     Kate 00:03:00 2020-02-04 00:03:00
1    Sarah 00:06:00 2020-02-04 00:06:00
2  Isabell 00:01:00 2020-02-04 00:01:00
3   Connie 00:07:00 2020-02-04 00:07:00
4     Elsa 00:23:00 2020-02-04 00:23:00
5     Anne 00:03:00 2020-02-04 00:03:00
6      Lin 00:04:00 2020-02-04 00:04:00

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_timedelta.html

Avoid looping in pandas at all cost

Maybe not at all cost, but pandas takes advantage of C implementations to improve performance by several orders of magnitude. There are many (many) common functions already implemented for our convenience.

Here is a great stackoverflow conversation about this very topic.

Upvotes: 2

Related Questions