BlueElement
BlueElement

Reputation: 41

PyCharm hitting MemoryError when doing conditional .loc

I'm running this on a large csv file (6mm+).

For some reason it seems that the MemoryError comes about when applying a conditional .loc to a LARGE dataframe column.

Any help on how to overcome the MemoryError when using the conditional .loc would be greatly appreciated.

columns = ['Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume']
EUR = pd.concat(
    (chunk for chunk in pd.read_csv(file, header=None, sep=";|,", engine="python", chunksize=100000)),
    ignore_index=True)

EUR.columns = columns

EUR['Date'] = EUR['Date'].str.cat(EUR['Time'], sep=" ")

EUR['datetime'] = pd.to_datetime(EUR.Date, dayfirst=True)

del EUR['Time']
del EUR['Date']
rows = EUR.shape[0]
print('Rows: ' + str(rows))
EUR['Date-1'] = EUR['datetime'].shift(-1)  # creating a new column
EUR['Open-1'] = EUR['Open'].shift(-1)  # creating a new column
EUR['TimeDiff'] = EUR['Date-1'] - EUR['datetime']
EUR['Minutesdiff'] = EUR['TimeDiff'] / np.timedelta64(1, 's') / 60
EUR['GapPercent'] = [
    (EUR.loc[ei, 'Open-1'] / EUR.loc[ei, 'Close'] - 1) * 100 if EUR.loc[ei, 'Minutesdiff'] > 44 else np.NaN for ei
    in
    EUR.index]

pd.set_option('display.max_columns', None)
print(EUR['GapPercent'].max())
print(EUR['GapPercent'].min())
rslt_df_pos = EUR.loc[EUR['GapPercent'] >= 0.75]
rslt_df_neg = EUR.loc[EUR['GapPercent'] <= -0.75]
rslt_df_pos.to_csv(r'''C:\gap_exports\EURposgap.csv''', index=True, header=True)
rslt_df_neg.to_csv(r'''C:\gap_exports\EURposgap.csv''', index=True, header=True)

Here's the output error

Rows: 6525490

1.6992939815894115

-1.2157646236684472

Traceback (most recent call last): File "C:/Gap_testing/$$$ backtestb.py", line 290, in load_csv(sheet)

File "C:/Gap_testing/$$$ backtestb.py", line 51, in load_csv rslt_df_pos = EUR.loc[EUR['GapPercent'] >= 0.75]

File "C:\Gap_testing\venv\lib\site-packages\pandas\core\indexing.py", line 1768, in getitem return self._getitem_axis(maybe_callable, axis=axis)

File "C:\Gap_testing\venv\lib\site-packages\pandas\core\indexing.py", line 1914, in _getitem_axis return self._getbool_axis(key, axis=axis)

File "C:\Gap_testing\venv\lib\site-packages\pandas\core\indexing.py", line 1784, in _getbool_axis return self.obj._take_with_is_copy(inds, axis=axis)

File "C:\Gap_testing\venv\lib\site-packages\pandas\core\generic.py", line 3409, in _take_with_is_copy result = self.take(indices=indices, axis=axis, **kwargs)

File "C:\Gap_testing\venv\lib\site-packages\pandas\core\generic.py", line 3392, in take self._consolidate_inplace()

File "C:\Gap_testing\venv\lib\site-packages\pandas\core\generic.py", line 5347, in _consolidate_inplace self._protect_consolidate(f)

File "C:\Gap_testing\venv\lib\site-packages\pandas\core\generic.py", line 5336, in _protect_consolidate result = f()

File "C:\Gap_testing\venv\lib\site-packages\pandas\core\generic.py", line 5345, in f self._data = self._data.consolidate()

File "C:\Gap_testing\venv\lib\site-packages\pandas\core\internals\managers.py", line 940, in consolidate bm._consolidate_inplace()

File "C:\Gap_testing\venv\lib\site-packages\pandas\core\internals\managers.py", line 945, in _consolidate_inplace self.blocks = tuple(_consolidate(self.blocks))

File "C:\Gap_testing\venv\lib\site-packages\pandas\core\internals\managers.py", line 1887, in _consolidate list(group_blocks), dtype=dtype, _can_consolidate=_can_consolidate

File "C:\Gap_testing\venv\lib\site-packages\pandas\core\internals\blocks.py", line 3099, in _merge_blocks new_values = np.vstack([b.values for b in blocks])

File "<array_function internals>", line 6, in vstack File "C:\Gap_testing\venv\lib\site-packages\numpy\core\shape_base.py", line 283, in vstack return _nx.concatenate(arrs, 0)

File "<array_function internals>", line 6, in concatenate MemoryError: Unable to allocate 348. MiB for an array with shape (7, 6525490) and data type float64

Upvotes: 0

Views: 166

Answers (1)

BlueElement
BlueElement

Reputation: 41

I'm sure there is a more elegant way to handle this but my solution was a while loop to break up the .loc into 400k slices.

column_cycle_open = 0
column_cycle = 400000
rslt_df = pd.DataFrame()

while column_cycle < rows:
    rslt_df = rslt_df.append(EUR[column_cycle_open:column_cycle].loc[EUR['GapPercent'] >= 0.75], ignore_index=True)
    rslt_df = rslt_df.append(EUR[column_cycle_open:column_cycle].loc[EUR['GapPercent'] <= -0.75], ignore_index=True)
    column_cycle_open += 400000
    column_cycle += 400000

rslt_df.to_csv(r'''C:\gap_exports\EURgap.csv''', index=True, header=True)

Happy to provide answer credit to a more elegant solution

Upvotes: 1

Related Questions