Anonymous
Anonymous

Reputation: 305

How are pandas data frames stored in memory?

In particular, when I create a DataFrame by concatenating two Pandas Series objects, does Python create a new memory location and store copies of the series', or does it just create references to the two series?

If it just makes references, then would modifying the series like series.name = "new_name" affect the column names of the DataFrame?

Also, would getting a series from a DataFrame like series = df['column_name'] take O(1) time or O(n) time?

Upvotes: 3

Views: 2938

Answers (1)

David Hope
David Hope

Reputation: 2256

A quick test shows that the cost is in the concat, vs. the dereference. So, BLUF, df['s1'] is O(1) while concat is O(n).

Running from 1 single item per series to 40 million per series, the dereference takes a similar amount of time, while the concat time seems to increase linearly.

using this simple code:

def func(frange):

    a1 = []
    a2 = []
    for x in numpy.arange(frange):
        a1.append(x)
        a2.append(-x)
        
    s1 = pd.Series(a1, index=a1, name='s1')
    s2 = pd.Series(a2, index=a1, name='s2')
    cstart =  time.perf_counter();
    df = pd.concat([s1, s2], axis=1)
    cend = time.perf_counter();
    
    tstart =  time.perf_counter();
    for y in range(100):
        series = df['s1']
        series2 = df['s2']
    tend = time.perf_counter();
    
    print (frange, ',', cend-cstart,tend-tstart)

the results are:

enter image description here

Upvotes: 4

Related Questions