Reputation: 1334
I am working in an ETL pipeline with pandas and I am exceding the memory usage of my computer.
I am reading of memory usage in Python and I don´t understand how memory usage works when I create a pandas Dataframe and I assign a name for this Dataframe and I use the same name to do some transformation or adding more columns to it.
For example:
df = pd.DataFrame(
{
'column1': [1,2]
,'column1': ['a','b']})
If now I want to add another column to this Dataframe:
df['column3'] = 1
The memory that is being used for this first df Dataframe is replaced for this new df Dataframe or now python is using memory for both Dataframes?
What happens if then I want to remove one of the columns?:
df = df.drop(columns = {'column1'})
Upvotes: 0
Views: 362
Reputation: 901
Reading the pandas documentation it says:
All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable. The length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame.
and also:
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
Also, if you inspect all your in scope variables with the dir()
command you can see that there is only one definition of the DataFrame.
To conclude, it seems to me that Python does not create a copy of your DataFrame and it only saves one copy when you add/remove a column. Furthermore, if you would like to create a copy of a DataFrame that actually copies all the values in another variable you should use the .copy()
function.
Upvotes: 1