Reputation: 52323
My question is about performance only, not semantics.
Does adding a new column to a df cause the data in the existing DataFrame to be physically copied to a new memory location (to ensure that the DataFrame occupies contiguous memory, for example)?
# using pandas 0.18.1, python 3.5
import pandas as pd
df = pd.DataFrame({'a': range(100)})
b = pd.Series(range(100))
df['b'] = b # is this operation expensive?
# equivalently df.loc[:, 'b'] = b
I know (from experimentation, couldn't find it in the documentation) that df['b'] = b
will semantically create a copy of b
, which obviously requires copying of underlying data. But I have no idea if the data in the other columns can stay where it was, or need to be moved sometimes.
Edit:
I know that adding a large number of columns is expensive. I'm only asking about adding a single column.
I also know that adding a row requires copying of the data in some cases (or always? -- not sure) for an obvious reason that the items in a single column have to be in contiguous memory.
Upvotes: 4
Views: 1086
Reputation: 863166
I think from my experiments that loc
is slowier and align new Series
with different index the slowiest:
But I have no idea if the data in the other columns can stay where it was, or need to be moved sometimes.
I think data are not moved, new columns are added to the end (maybe some exception can be here, but I dont know about it).
# using pandas 0.18.1, python 3.5
import pandas as pd
#len(df) = 10m
df = pd.DataFrame({'a': range(10000000)})
b = pd.Series(range(10000000))
c = pd.Series(range(10000000), index=df.index)
df['b'] = b
df.loc[:, 'c'] = b
df['d'] = c
df.loc[:, 'e'] = c
print (df)
In [36]: %timeit df['b'] = b
10 loops, best of 3: 23.5 ms per loop
In [37]: %timeit df.loc[:, 'c'] = b
The slowest run took 5.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 40 ms per loop
In [38]: %timeit df['d'] = c
10 loops, best of 3: 22.3 ms per loop
In [39]: %timeit df.loc[:, 'e'] = c
10 loops, best of 3: 39.5 ms per loop
But if change index
:
# using pandas 0.18.1, python 3.5
import pandas as pd
df = pd.DataFrame({'a': range(10000000)})
df.index = df.index + 15
b = pd.Series(range(10000000))
c = pd.Series(range(10000000), index=df.index)
df['b'] = b
df.loc[:, 'c'] = b
df['d'] = c
df.loc[:, 'e'] = c
print (df)
In [41]: %timeit df['b'] = b
1 loop, best of 3: 656 ms per loop
In [42]: %timeit df.loc[:, 'c'] = b
1 loop, best of 3: 735 ms per loop
In [43]: %timeit df['d'] = c
10 loops, best of 3: 22.4 ms per loop
In [44]: %timeit df.loc[:, 'e'] = c
10 loops, best of 3: 56.6 ms per loop
If add new row, it is fast, I think it depends of length of Series
:
In [68]: %timeit df.loc[10000015, :] = pd.Series([1,2,3,2,4], index=df.columns)
1000 loops, best of 3: 274 µs per loop
But if add many rows, it is expensive and I think this can be avoided.
Upvotes: 2