How is memory handled in Pandas when not capturing return value of transpose

Question

I'm dealing with a large dataframe (~100,000x1000) that I eventually output using df.to_csv(). All my inputs which I turn into this large dataframe come transposed relative to the output, so when building the large dataframe it ends transposed relative to the output. At the very end I transpose: df.T.to_csv(). I know the return value of df.T is the transposed dataframe which leads to my question, by not saving the df.T does it "help" my memory usage? Phrased differently, is df.T.to_csv() better than dfT=df.T and dfT.to_csv() run separately? Aside from memory as there any advantages to one method over the other?

In summary which method is better and why?:

method 1:

df.T.to_csv()

method 2:

dfT=df.T
dfT.to_csv()

G. Anderson · Accepted Answer

Overall, the two approaches are practically identical for this use case. Consider: The script still causes the transpose to be calculated and stored ion memory in order to be able to act on it. The only real difference might come in what happens after this line of code runs.

In the first case, df.T.to_csv() calculates and stores the transpose dataframe, writes it to file, and then the implicit instruction is that automated garbage collection is free to do what it will with the allocated memory for the object.

In the second case, because you've assigned it, the implicit instruction is to maintain the allocated memory and the object stored therein, until the script finishes running. The only real "advantage" I can think of to the second method is that you can reuse the transpose dataframe for other things if you need to.

This certainly holds true in my test case (using the memit memory profiler magic in jupyter notebook):

df=pd.DataFrame(np.random.rand(10000,10000))

%%memit
df.T.to_csv('test_transpose.csv')

peak memory: 929.00 MiB, increment: 34.18 MiB

%%memit
dfT=df.T
dfT.to_csv('test_transpose.csv')

peak memory: 929.84 MiB, increment: 33.66 MiB

And, using timing instead of memory profiling:

%%timeit
df.T.to_csv('test_transpose.csv')

2min 49s ± 6.3 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
dfT=df.T
dfT.to_csv('test_transpose.csv')

2min 51s ± 4.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

How is memory handled in Pandas when not capturing return value of transpose

Answers (1)

Related Questions