How to replace a df.loc multiple condition with a categorical type to safe memory in a pandas dataframe?

Question

I want to add a column to a very large pandas dataframe df1 (>10GB as .csv) that indicates if multiple conditions in other columns are met.

In the moment, I am doing

df.loc[(df.col1  == 1) & (df.col2 == 0) & (df.col3 == 1), "col4"] = "start"

and

df.loc[(df.col1  == 1) & (df2.col2 == 1) & (df.col3 == 0), "col4"] = "stop"

However, I am getting a MemoryError from the first df.loc line. This is most likely due to millions of "start" and "strop" strings in memory.

How to replace the df.loc lines with a boolean condition to avoid the MemoryError, while having the same visual results?

unutbu · Accepted Answer

Below I show how to create a categorical Series with small memory footprint. However, also keep in mind that it might be easier to process your DataFrame in chunks if your problem allows it.

The smallest values (memory-wise) in a NumPy array occupy 1 byte. These arrays have dtype np.int8 (for 8-bit ints), or np.bool or np.dtype('S1').

In [121]: np.dtype('int8').itemsize
Out[121]: 1

In [124]: np.dtype('int64').itemsize
Out[124]: 8

In [122]: np.dtype('bool').itemsize
Out[122]: 1

In [123]: np.dtype('S1').itemsize
Out[123]: 1

The underlying data in DataFrame columns are stored in NumPy arrays. So to make a DataFrame as small as possible, use a 1-byte dtype. This creates a DataFrame column of dtype int8 filled with zeros:

df['col4'] = np.zeros(len(df), dtype='int8')

Let 1 represent "start" and 2 represent "stop":

df.loc[(df.col1  == 1) & (df.col2 == 0) & (df.col3 == 1), "col4"] = 1
df.loc[(df.col1  == 1) & (df.col2 == 1) & (df.col3 == 0), "col4"] = 2

Note that besides the memory required for the col4 column, the above two lines requires additional space to compute 4 boolean Series: one for each of the 3 conditions, and a fourth to combine them. If those lines still raise MemoryErrors, you could try

mask = (df.col1  == 1)     # <-- requires space for 1 boolean Series, `mask`
mask &= (df.col2 == 0)     # <-- requires space for 2 boolean Series: mask and a tempory Series
mask &= (df.col3 == 1)     # <-- requires no additional space
df.loc[mask, 'col4'] = 1   # <-- requires no additional space

If conserving memory is of prime importance, you should stop here. However, if you want the 1's and 2's displayed as "start" and "stop", you could change the DataFrame column to category dtype:

df['col4'] = df['col4'].astype('category')

and then change the category labels:

df['col4'].cat.categories = ['', 'start', 'stop']

import numpy as np
import pandas as pd
np.random.seed(2017)
nrows, ncols = 20, 3
df = pd.DataFrame(np.random.randint(2, size=(nrows, ncols)), 
                  columns=['col1', 'col2', 'col3'])
df['col4'] = np.zeros(len(df), dtype='int8')
print(df['col4'].nbytes)
# df.loc[(df.col1  == 1) & (df.col2 == 0) & (df.col3 == 1), "col4"] = 1
# df.loc[(df.col1  == 1) & (df.col2 == 1) & (df.col3 == 0), "col4"] = 2
mask = (df.col1  == 1)
mask &= (df.col2 == 0)
mask &= (df.col3 == 1)
df.loc[mask, 'col4'] = 1
mask = (df.col1  == 1)
mask &= (df.col2 == 1)
mask &= (df.col3 == 0)
df.loc[mask, 'col4'] = 2
df['col4'] = df['col4'].astype('category')
print(df['col4'].nbytes)
df['col4'].cat.categories = ['', 'start', 'stop']
print(df['col4'].nbytes)
print(df)

yields

20   # the number of bytes required by `col4`
44   # a category column requires a bit more space
44   # the change of labels require a tiny bit more space, but not shown here

    col1  col2  col3   col4
0      1     1     0   stop
1      1     0     0       
2      0     0     1       
3      1     1     1       
4      0     0     0       
5      0     0     1       
6      1     0     0       
7      0     0     0       
8      1     0     1  start
9      1     1     0   stop
10     1     1     1       
11     1     0     1  start
12     0     0     0       
13     0     0     1       
14     0     0     0       
15     1     0     1  start
16     0     1     0       
17     0     1     1       
18     1     0     1  start
19     0     0     1

How to replace a df.loc multiple condition with a categorical type to safe memory in a pandas dataframe?

Answers (1)

Related Questions