Reputation: 4018
I want to add a column to a very large pandas dataframe df1
(>10GB as .csv) that indicates if multiple conditions in other columns are met.
In the moment, I am doing
df.loc[(df.col1 == 1) & (df.col2 == 0) & (df.col3 == 1), "col4"] = "start"
and
df.loc[(df.col1 == 1) & (df2.col2 == 1) & (df.col3 == 0), "col4"] = "stop"
However, I am getting a MemoryError
from the first df.loc
line. This is most likely due to millions of "start" and "strop" strings in memory.
How to replace the df.loc
lines with a boolean condition to avoid the MemoryError
, while having the same visual results?
Upvotes: 0
Views: 3012
Reputation: 879739
Below I show how to create a categorical Series with small memory footprint. However, also keep in mind that it might be easier to process your DataFrame in chunks if your problem allows it.
The smallest values (memory-wise) in a NumPy array occupy 1 byte. These arrays have dtype np.int8
(for 8-bit ints), or np.bool
or np.dtype('S1')
.
In [121]: np.dtype('int8').itemsize
Out[121]: 1
In [124]: np.dtype('int64').itemsize
Out[124]: 8
In [122]: np.dtype('bool').itemsize
Out[122]: 1
In [123]: np.dtype('S1').itemsize
Out[123]: 1
The underlying data in DataFrame columns are stored in NumPy arrays.
So to make a DataFrame as small as possible, use a 1-byte dtype.
This creates a DataFrame column of dtype int8
filled with zeros:
df['col4'] = np.zeros(len(df), dtype='int8')
Let 1
represent "start"
and 2
represent "stop"
:
df.loc[(df.col1 == 1) & (df.col2 == 0) & (df.col3 == 1), "col4"] = 1
df.loc[(df.col1 == 1) & (df.col2 == 1) & (df.col3 == 0), "col4"] = 2
Note that besides the memory required for the col4
column, the above two lines requires additional space to compute 4 boolean Series: one for each of the 3 conditions, and a fourth to combine them. If those lines still raise MemoryErrors, you could try
mask = (df.col1 == 1) # <-- requires space for 1 boolean Series, `mask`
mask &= (df.col2 == 0) # <-- requires space for 2 boolean Series: mask and a tempory Series
mask &= (df.col3 == 1) # <-- requires no additional space
df.loc[mask, 'col4'] = 1 # <-- requires no additional space
If conserving memory is of prime importance, you should stop here.
However, if you want the 1's and 2's displayed as "start"
and "stop"
, you could change the DataFrame column to category
dtype:
df['col4'] = df['col4'].astype('category')
and then change the category labels:
df['col4'].cat.categories = ['', 'start', 'stop']
import numpy as np
import pandas as pd
np.random.seed(2017)
nrows, ncols = 20, 3
df = pd.DataFrame(np.random.randint(2, size=(nrows, ncols)),
columns=['col1', 'col2', 'col3'])
df['col4'] = np.zeros(len(df), dtype='int8')
print(df['col4'].nbytes)
# df.loc[(df.col1 == 1) & (df.col2 == 0) & (df.col3 == 1), "col4"] = 1
# df.loc[(df.col1 == 1) & (df.col2 == 1) & (df.col3 == 0), "col4"] = 2
mask = (df.col1 == 1)
mask &= (df.col2 == 0)
mask &= (df.col3 == 1)
df.loc[mask, 'col4'] = 1
mask = (df.col1 == 1)
mask &= (df.col2 == 1)
mask &= (df.col3 == 0)
df.loc[mask, 'col4'] = 2
df['col4'] = df['col4'].astype('category')
print(df['col4'].nbytes)
df['col4'].cat.categories = ['', 'start', 'stop']
print(df['col4'].nbytes)
print(df)
yields
20 # the number of bytes required by `col4`
44 # a category column requires a bit more space
44 # the change of labels require a tiny bit more space, but not shown here
col1 col2 col3 col4
0 1 1 0 stop
1 1 0 0
2 0 0 1
3 1 1 1
4 0 0 0
5 0 0 1
6 1 0 0
7 0 0 0
8 1 0 1 start
9 1 1 0 stop
10 1 1 1
11 1 0 1 start
12 0 0 0
13 0 0 1
14 0 0 0
15 1 0 1 start
16 0 1 0
17 0 1 1
18 1 0 1 start
19 0 0 1
Upvotes: 1