vks
vks

Reputation: 67968

Pandas MemoryError while pd.concat

I am using pandas to read csv's.

df_from_each_file = (pd.read_csv(StringIO(f)), compression='gzip', dtype=str) for f in all_files)
final_df = pd.concat(df_from_each_file, ignore_index=True)

total rows in all_files is about 90,00,000 though each file is of smaller size.

When pd.concat runs it is failing citing Memory Error.

The system has 16 GB of RAM and 16 CPU's 2 GHZ each.Is the memory insufficent here? Is there anything else i can do to remove MemoryError?

i read about chunksize etc but each file is small and that should not be a problem.How can concat be made memoryerror free?

This is the traceback.

final_df = pd.concat(df_from_each_file, ignore_index=True)
File "/home/jenkins/fsroot/workspace/ric-dev-sim-2/VENV/lib/python2.7/site-packages/pandas/tools/merge.py", line 1326, in concat
return op.get_result()
File "/home/jenkins/fsroot/workspace/ric-dev-sim-2/VENV/lib/python2.7/site-packages/pandas/tools/merge.py", line 1517, in get_result
copy=self.copy)
File "/home/jenkins/fsroot/workspace/ric-dev-sim-2/VENV/lib/python2.7/site-packages/pandas/core/internals.py", line 4797, in concatenate_block_managers
placement=placement) for placement, join_units in concat_plan]
File "/home/jenkins/fsroot/workspace/ric-dev-sim-2/VENV/lib/python2.7/site-packages/pandas/core/internals.py", line 4902, in concatenate_join_units
concat_values = _concat._concat_compat(to_concat, axis=concat_axis)
File "/home/jenkins/fsroot/workspace/ric-dev-sim-2/VENV/lib/python2.7/site-packages/pandas/types/concat.py", line 165, in _concat_compat
return np.concatenate(to_concat, axis=axis)
MemoryError

df.info for 1 file is

dtype: object<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12516 entries, 0 to 12515
Columns: 322 entries, #RIC to Reuters Classification Scheme.1
dtypes: object(322)
memory usage: 30.7+ MB
None

Upvotes: 3

Views: 6678

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210812

First of all don't use dtype=str parameter unless you really need it.

Looking at your next question you would need at least 2*90GB = 180GB of RAM for 9M rows (90GB for the resulting DF plus list 90GB for list of DFs that you are concatenating) if you will use this approach:

Calculation 17.1GB / 1713078 * (9*10**6) / 1GB:

In [18]: 17.1*1024**3/1713078*(9*10**6)/1024**3
Out[18]: 89.8382910760631

So you will have to process your data file-per-file and to save it to something that can work with such amount of data - i would use either HDF or database like MySQL / PostgreSQL / etc.:

fn = r'c:/tmp/test.h5'
store = pd.HDFStore(fn)

df = pd.DataFrame()
for f in all_file_names:
    x = pd.read_csv(f)
    # process `x` DF here
    store.append('df_key', df, data_columns=[<list_of_indexed_columns>], complib='blosc', complevel=5)

store.close()

Upvotes: 2

Related Questions