How are pandas data frames stored in memory?

Question

In particular, when I create a DataFrame by concatenating two Pandas Series objects, does Python create a new memory location and store copies of the series', or does it just create references to the two series?

If it just makes references, then would modifying the series like series.name = "new_name" affect the column names of the DataFrame?

Also, would getting a series from a DataFrame like series = df['column_name'] take O(1) time or O(n) time?

David Hope · Accepted Answer

A quick test shows that the cost is in the concat, vs. the dereference. So, BLUF, df['s1'] is O(1) while concat is O(n).

Running from 1 single item per series to 40 million per series, the dereference takes a similar amount of time, while the concat time seems to increase linearly.

using this simple code:

def func(frange):

    a1 = []
    a2 = []
    for x in numpy.arange(frange):
        a1.append(x)
        a2.append(-x)
        
    s1 = pd.Series(a1, index=a1, name='s1')
    s2 = pd.Series(a2, index=a1, name='s2')
    cstart =  time.perf_counter();
    df = pd.concat([s1, s2], axis=1)
    cend = time.perf_counter();
    
    tstart =  time.perf_counter();
    for y in range(100):
        series = df['s1']
        series2 = df['s2']
    tend = time.perf_counter();
    
    print (frange, ',', cend-cstart,tend-tstart)

the results are:

How are pandas data frames stored in memory?

Answers (1)

Related Questions