Reputation: 81
I'm finding some difficulties in creating a shallow copy of two columns in a Pandas DataFrame.
I have the following code:
import numpy as np
import pandas as pd
data = pd.DataFrame(np.zeros((5,3)), columns=["a","b","c"])
print(data)
b = data.loc[:, "a"]
b += 1
print(data)
In this way I have a reference to the first column of the dataframe and b+1
effectively adds 1 to the first column of data (printing data I can see that the values changed).
I would like to do something similar but with two columns. I've noticed, tough, that defining b as b = data.loc[:, ["a", "b"]]
, I get the two columns I want, but the variable b
is then independent from data
(changing b does not change values in data).
I hope someone could help me figuring this out.
Thanks,
Lorenzo
EDIT
As pointed out in the comments, I should have included the expected output. With the block of code included above, the output is equal to the expected one and is:
a b c
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
a b c
0 1.0 0.0 0.0
1 1.0 0.0 0.0
2 1.0 0.0 0.0
3 1.0 0.0 0.0
4 1.0 0.0 0.0
If I now run the code
import numpy as np
import pandas as pd
data = pd.DataFrame(np.zeros((5,3)), columns=["a","b","c"])
print(data)
b = data.loc[:, ["a", "b"]]
b += 1
print(data)
While the expected output for the second print statement is
a b c
0 1.0 1.0 0.0
1 1.0 1.0 0.0
2 1.0 1.0 0.0
3 1.0 1.0 0.0
4 1.0 1.0 0.0
I actually obtain
a b c
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
Upvotes: 2
Views: 300
Reputation: 13407
Unfortunately pandas
is not a super memory efficient library, such that the majority of its operations do not return a view, but a copy that operates independently from it's source. Small operations as you've pointed out like .loc
with a single column can in fact return views, but 99% of other pandas methods copy the underlying data before doing anything and return that copy. While this appears to be an annoyance to some, it saves the majority of users from annoying bugs where values in the source dataset change while trying to manipulate a subset.
If you want a zero-copy dataframe framework, I would recommend vaex. While I haven't used it a ton personally, I have been keeping my eye on this project as it has matured. It is syntactically similar to pandas
but is more memory-efficient and even works seamlessly with datasets larger than your RAM capacity.
As a related piece, I would also recommend checking out this article https://realpython.com/pandas-settingwithcopywarning/ for a little bit more on how pandas and numpy handle views vs copies.
Upvotes: 3