László
László

Reputation: 4144

efficient appending to pandas dataframes

I see that dataframes have a .pop method but .append returns a new object (unlike for lists). It can be inefficient to constantly reallocate memory for the dataframe as I am adding rows (also see this answer testing preallocating space in reply to a similar question). But I need to duplicate (then modify) some rows, as I outlined in another question (example repeated below) — is this efficient to do with appending rows to the end of dataframe, or in some other way?

I want to get from this (focus on id 2):

id                    start                     end
 1      2011-01-01 10:00:00     2011-01-08 16:03:00
 2      2011-01-28 03:45:00     2011-02-04 15:22:00
 3      2011-03-02 11:04:00     2011-03-05 05:24:00

To this:

id                    start                     end     month      stay
 1      2011-01-01 10:00:00     2011-01-08 16:03:00   2011-01         7
 2      2011-01-28 03:45:00     2011-01-31 23:59:59   2011-01         4
 2      2011-02-01 00:00:00     2011-02-04 15:22:00   2011-02         4
 3      2011-03-02 11:04:00     2011-03-05 05:24:00   2011-03         3

Upvotes: 3

Views: 4490

Answers (2)

crypdick
crypdick

Reputation: 19834

Not sure if this is the best solution, but I would make a separate dataframe.

New DF:

id                    start                     end          stay
 1      NaT                     NaT                          NaN
 1      NaT                     NaT                          NaN
 1      NaT                     NaT                          NaN
 2      NaT                     NaT                          NaN
 2      NaT                     NaT                          NaN
 2      NaT                     NaT                          NaN

Step 1 of algorithm simply inserts dates that don't span more than one month:

id                    start                     end          stay
 1      2011-01-01 10:00:00     2011-01-08 16:03:00          NaN
 1      NaT                     NaT                          NaN
 1      NaT                     NaT                          NaN
 2      NaT                     NaT                          NaN
 2      NaT                     NaT                          NaN
 2      NaT                     NaT                          NaN

Step 2 of algorithm splits at end of months and inserts. Step 3 calculates the stay.

id                    start                     end          stay
 1      2011-01-01 10:00:00     2011-01-08 16:03:00          7
 1      NaT                     NaT                          NaN
 1      NaT                     NaT                          NaN
 2      2011-01-28 03:45:00     2011-01-31 23:59:59          4
 2      2011-02-01 00:00:00     2011-02-04 15:22:00          4
 2      NaT                     NaT                          NaN

Then select rows without NaT/NaNs and save it as the final Df.

Upvotes: 0

Chad Kennedy
Chad Kennedy

Reputation: 1736

What you definitely don't want to do is insert one row at a time. You'll end up making a full copy of the dataframe with each insertion. If, for any given row, you will append at most one extra row, you could do the following steps:

1) load the dataframe from your source

2) append an uninitialized dataframe to the end of your original dataframe, with the same length

3) starting at the end of the original dataframe (now the middle), copy rows to a new location such that there is an extra row between each original row (index 10 -> index 20, index 9 -> index 18, etc)

4) Zero all odd indices

5) run your algorithm to fill in blank rows with your data as necessary

6) at the end, remove all blank (all 0's) rows

This will have the effective efficiency of approx 4 copies, much better than a copy for each insert.

Upvotes: 3

Related Questions