Reputation: 41
I'm running this on a large csv file (6mm+).
For some reason it seems that the MemoryError comes about when applying a conditional .loc to a LARGE dataframe column.
Any help on how to overcome the MemoryError when using the conditional .loc would be greatly appreciated.
columns = ['Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume']
EUR = pd.concat(
(chunk for chunk in pd.read_csv(file, header=None, sep=";|,", engine="python", chunksize=100000)),
ignore_index=True)
EUR.columns = columns
EUR['Date'] = EUR['Date'].str.cat(EUR['Time'], sep=" ")
EUR['datetime'] = pd.to_datetime(EUR.Date, dayfirst=True)
del EUR['Time']
del EUR['Date']
rows = EUR.shape[0]
print('Rows: ' + str(rows))
EUR['Date-1'] = EUR['datetime'].shift(-1) # creating a new column
EUR['Open-1'] = EUR['Open'].shift(-1) # creating a new column
EUR['TimeDiff'] = EUR['Date-1'] - EUR['datetime']
EUR['Minutesdiff'] = EUR['TimeDiff'] / np.timedelta64(1, 's') / 60
EUR['GapPercent'] = [
(EUR.loc[ei, 'Open-1'] / EUR.loc[ei, 'Close'] - 1) * 100 if EUR.loc[ei, 'Minutesdiff'] > 44 else np.NaN for ei
in
EUR.index]
pd.set_option('display.max_columns', None)
print(EUR['GapPercent'].max())
print(EUR['GapPercent'].min())
rslt_df_pos = EUR.loc[EUR['GapPercent'] >= 0.75]
rslt_df_neg = EUR.loc[EUR['GapPercent'] <= -0.75]
rslt_df_pos.to_csv(r'''C:\gap_exports\EURposgap.csv''', index=True, header=True)
rslt_df_neg.to_csv(r'''C:\gap_exports\EURposgap.csv''', index=True, header=True)
Here's the output error
Rows: 6525490
1.6992939815894115
-1.2157646236684472
Traceback (most recent call last): File "C:/Gap_testing/$$$ backtestb.py", line 290, in load_csv(sheet)
File "C:/Gap_testing/$$$ backtestb.py", line 51, in load_csv rslt_df_pos = EUR.loc[EUR['GapPercent'] >= 0.75]
File "C:\Gap_testing\venv\lib\site-packages\pandas\core\indexing.py", line 1768, in getitem return self._getitem_axis(maybe_callable, axis=axis)
File "C:\Gap_testing\venv\lib\site-packages\pandas\core\indexing.py", line 1914, in _getitem_axis return self._getbool_axis(key, axis=axis)
File "C:\Gap_testing\venv\lib\site-packages\pandas\core\indexing.py", line 1784, in _getbool_axis return self.obj._take_with_is_copy(inds, axis=axis)
File "C:\Gap_testing\venv\lib\site-packages\pandas\core\generic.py", line 3409, in _take_with_is_copy result = self.take(indices=indices, axis=axis, **kwargs)
File "C:\Gap_testing\venv\lib\site-packages\pandas\core\generic.py", line 3392, in take self._consolidate_inplace()
File "C:\Gap_testing\venv\lib\site-packages\pandas\core\generic.py", line 5347, in _consolidate_inplace self._protect_consolidate(f)
File "C:\Gap_testing\venv\lib\site-packages\pandas\core\generic.py", line 5336, in _protect_consolidate result = f()
File "C:\Gap_testing\venv\lib\site-packages\pandas\core\generic.py", line 5345, in f self._data = self._data.consolidate()
File "C:\Gap_testing\venv\lib\site-packages\pandas\core\internals\managers.py", line 940, in consolidate bm._consolidate_inplace()
File "C:\Gap_testing\venv\lib\site-packages\pandas\core\internals\managers.py", line 945, in _consolidate_inplace self.blocks = tuple(_consolidate(self.blocks))
File "C:\Gap_testing\venv\lib\site-packages\pandas\core\internals\managers.py", line 1887, in _consolidate list(group_blocks), dtype=dtype, _can_consolidate=_can_consolidate
File "C:\Gap_testing\venv\lib\site-packages\pandas\core\internals\blocks.py", line 3099, in _merge_blocks new_values = np.vstack([b.values for b in blocks])
File "<array_function internals>", line 6, in vstack File "C:\Gap_testing\venv\lib\site-packages\numpy\core\shape_base.py", line 283, in vstack return _nx.concatenate(arrs, 0)
File "<array_function internals>", line 6, in concatenate MemoryError: Unable to allocate 348. MiB for an array with shape (7, 6525490) and data type float64
Upvotes: 0
Views: 166
Reputation: 41
I'm sure there is a more elegant way to handle this but my solution was a while loop to break up the .loc into 400k slices.
column_cycle_open = 0
column_cycle = 400000
rslt_df = pd.DataFrame()
while column_cycle < rows:
rslt_df = rslt_df.append(EUR[column_cycle_open:column_cycle].loc[EUR['GapPercent'] >= 0.75], ignore_index=True)
rslt_df = rslt_df.append(EUR[column_cycle_open:column_cycle].loc[EUR['GapPercent'] <= -0.75], ignore_index=True)
column_cycle_open += 400000
column_cycle += 400000
rslt_df.to_csv(r'''C:\gap_exports\EURgap.csv''', index=True, header=True)
Happy to provide answer credit to a more elegant solution
Upvotes: 1