Lorenzo
Lorenzo

Reputation: 81

Pandas shallow copy two columns of a DataFrame

I'm finding some difficulties in creating a shallow copy of two columns in a Pandas DataFrame.

I have the following code:

import numpy as np
import pandas as pd

data = pd.DataFrame(np.zeros((5,3)), columns=["a","b","c"])
print(data)
b = data.loc[:, "a"]
b += 1
print(data)

In this way I have a reference to the first column of the dataframe and b+1 effectively adds 1 to the first column of data (printing data I can see that the values changed). I would like to do something similar but with two columns. I've noticed, tough, that defining b as b = data.loc[:, ["a", "b"]], I get the two columns I want, but the variable b is then independent from data (changing b does not change values in data). I hope someone could help me figuring this out.

Thanks,

Lorenzo

EDIT

As pointed out in the comments, I should have included the expected output. With the block of code included above, the output is equal to the expected one and is:

   a   b   c 
0 0.0 0.0 0.0
1 0.0 0.0 0.0 
2 0.0 0.0 0.0 
3 0.0 0.0 0.0 
4 0.0 0.0 0.0

   a   b   c 
0 1.0 0.0 0.0
1 1.0 0.0 0.0 
2 1.0 0.0 0.0 
3 1.0 0.0 0.0 
4 1.0 0.0 0.0

If I now run the code

import numpy as np
import pandas as pd

data = pd.DataFrame(np.zeros((5,3)), columns=["a","b","c"])
print(data)
b = data.loc[:, ["a", "b"]]
b += 1
print(data)

While the expected output for the second print statement is

   a   b   c 
0 1.0 1.0 0.0
1 1.0 1.0 0.0 
2 1.0 1.0 0.0 
3 1.0 1.0 0.0 
4 1.0 1.0 0.0

I actually obtain

   a   b   c 
0 0.0 0.0 0.0
1 0.0 0.0 0.0 
2 0.0 0.0 0.0 
3 0.0 0.0 0.0 
4 0.0 0.0 0.0

Upvotes: 2

Views: 300

Answers (1)

Cameron Riddell
Cameron Riddell

Reputation: 13407

Unfortunately pandas is not a super memory efficient library, such that the majority of its operations do not return a view, but a copy that operates independently from it's source. Small operations as you've pointed out like .loc with a single column can in fact return views, but 99% of other pandas methods copy the underlying data before doing anything and return that copy. While this appears to be an annoyance to some, it saves the majority of users from annoying bugs where values in the source dataset change while trying to manipulate a subset.

If you want a zero-copy dataframe framework, I would recommend vaex. While I haven't used it a ton personally, I have been keeping my eye on this project as it has matured. It is syntactically similar to pandas but is more memory-efficient and even works seamlessly with datasets larger than your RAM capacity.

As a related piece, I would also recommend checking out this article https://realpython.com/pandas-settingwithcopywarning/ for a little bit more on how pandas and numpy handle views vs copies.

Upvotes: 3

Related Questions