Reputation: 1366
Does anyone know why a pandas object copy()
method seems much slower than reconstructing the object? Is there any reason to use the copy()
method over a standard constructor?
Here is a quick result:
In [42]: import pandas as pd
In [43]: df = pd.DataFrame(np.random.rand(300000).reshape(100000,3), columns=list('ABC'))
In [44]: %timeit pd.DataFrame(df)
The slowest run took 5.61 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.95 µs per loop
In [45]: %timeit df.copy()
The slowest run took 5.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 390 µs per loop
The discrepancy between copy operations carries over to pandas Series as well. Interestingly, numpy arrays don't exhibit the same type of behavior, eg:
In [48]: import numpy as np
In [49]: myarray = np.random.rand(300000)
In [50]: %timeit myarray.copy()
10000 loops, best of 3: 162 µs per loop
In [52]: %timeit np.array(myarray)
10000 loops, best of 3: 168 µs per loop
Upvotes: 4
Views: 74
Reputation: 375675
It's because the copy actually creates a new internal representation of the DataFrame, whilst using the contructor just points to the same one:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: id(df._data) # internal attribute, don't futz with it!
Out[12]: 4472136472
In [13]: df1 = df.copy()
In [14]: id(df1._data) # different object
Out[14]: 4472572448
In [15]: df2 = pd.DataFrame(df)
In [16]: id(df2._data) # same as df._data
Out[16]: 4472136472
A corollary is that if you mutate the original DataFrame it'll change df2 but not df1 (the copy):
In [21]: df.iloc[0, 0] = 99
In [22]: df
Out[22]:
A B
0 99 2
1 3 4
In [23]: df1
Out[23]:
A B
0 1 2
1 3 4
In [24]: df2
Out[24]:
A B
0 99 2
1 3 4
This is the reason you want to use copy!
In numpy both copy and the constructor make a copy:
In [31]: a = np.array([1, 2])
In [32]: a1 = a.copy()
In [33]: a2 = np.array(a)
In [34]: a[0] = 99
In [35]: a1
Out[35]: array([1, 2])
In [36]: a2
Out[36]: array([1, 2])
Upvotes: 3