Reputation: 67968
I am using pandas to read csv's.
df_from_each_file = (pd.read_csv(StringIO(f)), compression='gzip', dtype=str) for f in all_files)
final_df = pd.concat(df_from_each_file, ignore_index=True)
total rows in all_files is about 90,00,000 though each file is of smaller size.
When pd.concat runs it is failing citing Memory Error
.
The system has 16 GB of RAM and 16 CPU's 2 GHZ each.Is the memory insufficent here? Is there anything else i can do to remove MemoryError?
i read about chunksize etc but each file is small and that should not be a problem.How can concat
be made memoryerror free?
This is the traceback.
final_df = pd.concat(df_from_each_file, ignore_index=True)
File "/home/jenkins/fsroot/workspace/ric-dev-sim-2/VENV/lib/python2.7/site-packages/pandas/tools/merge.py", line 1326, in concat
return op.get_result()
File "/home/jenkins/fsroot/workspace/ric-dev-sim-2/VENV/lib/python2.7/site-packages/pandas/tools/merge.py", line 1517, in get_result
copy=self.copy)
File "/home/jenkins/fsroot/workspace/ric-dev-sim-2/VENV/lib/python2.7/site-packages/pandas/core/internals.py", line 4797, in concatenate_block_managers
placement=placement) for placement, join_units in concat_plan]
File "/home/jenkins/fsroot/workspace/ric-dev-sim-2/VENV/lib/python2.7/site-packages/pandas/core/internals.py", line 4902, in concatenate_join_units
concat_values = _concat._concat_compat(to_concat, axis=concat_axis)
File "/home/jenkins/fsroot/workspace/ric-dev-sim-2/VENV/lib/python2.7/site-packages/pandas/types/concat.py", line 165, in _concat_compat
return np.concatenate(to_concat, axis=axis)
MemoryError
df.info for 1 file is
dtype: object<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12516 entries, 0 to 12515
Columns: 322 entries, #RIC to Reuters Classification Scheme.1
dtypes: object(322)
memory usage: 30.7+ MB
None
Upvotes: 3
Views: 6678
Reputation: 210812
First of all don't use dtype=str
parameter unless you really need it.
Looking at your next question you would need at least 2*90GB = 180GB of RAM for 9M rows (90GB for the resulting DF plus list 90GB for list of DFs that you are concatenating) if you will use this approach:
Calculation 17.1GB / 1713078 * (9*10**6) / 1GB
:
In [18]: 17.1*1024**3/1713078*(9*10**6)/1024**3
Out[18]: 89.8382910760631
So you will have to process your data file-per-file and to save it to something that can work with such amount of data - i would use either HDF or database like MySQL / PostgreSQL / etc.:
fn = r'c:/tmp/test.h5'
store = pd.HDFStore(fn)
df = pd.DataFrame()
for f in all_file_names:
x = pd.read_csv(f)
# process `x` DF here
store.append('df_key', df, data_columns=[<list_of_indexed_columns>], complib='blosc', complevel=5)
store.close()
Upvotes: 2