clocker
clocker

Reputation: 1366

Reconstructing a pandas object versus copy()

Does anyone know why a pandas object copy() method seems much slower than reconstructing the object? Is there any reason to use the copy() method over a standard constructor?

Here is a quick result:

In [42]: import pandas as pd

In [43]: df = pd.DataFrame(np.random.rand(300000).reshape(100000,3), columns=list('ABC'))

In [44]: %timeit pd.DataFrame(df)
The slowest run took 5.61 times longer than the fastest. This could mean   that an intermediate result is being cached.
100000 loops, best of 3: 3.95 µs per loop

In [45]: %timeit df.copy() 
The slowest run took 5.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 390 µs per loop

The discrepancy between copy operations carries over to pandas Series as well. Interestingly, numpy arrays don't exhibit the same type of behavior, eg:

In [48]: import numpy as np

In [49]: myarray = np.random.rand(300000)

In [50]: %timeit myarray.copy()
10000 loops, best of 3: 162 µs per loop

In [52]: %timeit np.array(myarray)
10000 loops, best of 3: 168 µs per loop

Upvotes: 4

Views: 74

Answers (1)

Andy Hayden
Andy Hayden

Reputation: 375675

It's because the copy actually creates a new internal representation of the DataFrame, whilst using the contructor just points to the same one:

In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])

In [12]: id(df._data)  # internal attribute, don't futz with it!
Out[12]: 4472136472

In [13]: df1 = df.copy()

In [14]: id(df1._data)  # different object
Out[14]: 4472572448

In [15]: df2 = pd.DataFrame(df)

In [16]: id(df2._data)  # same as df._data
Out[16]: 4472136472

A corollary is that if you mutate the original DataFrame it'll change df2 but not df1 (the copy):

In [21]: df.iloc[0, 0] = 99

In [22]: df
Out[22]:
    A  B
0  99  2
1   3  4

In [23]: df1
Out[23]:
   A  B
0  1  2
1  3  4

In [24]: df2
Out[24]:
    A  B
0  99  2
1   3  4

This is the reason you want to use copy!


In numpy both copy and the constructor make a copy:

In [31]: a = np.array([1, 2])

In [32]: a1 = a.copy()

In [33]: a2 = np.array(a)

In [34]: a[0] = 99

In [35]: a1
Out[35]: array([1, 2])

In [36]: a2
Out[36]: array([1, 2])

Upvotes: 3

Related Questions