Alpha
Alpha

Reputation: 2452

Too many open files in Windows when writing multiple HDF5 files

My question is how to close HDF5 files indefinitely after writing them?

I am trying to save data to HDF5 files - there are around 200 folders and each folder contains some data for each day for this year.

When I retrieve and save data using pandas HDFStore with following code in iPython console, the function stop automatically after a while (no error msg).

import pandas as pd

data = ... # in format as pd.DataFrame
# Method 1
data.to_hdf('D:/file_001/2016-01-01.h5', 'type_1')
# Method 2
with pd.HDFStore('D:/file_001/2016-01-01.h5', 'a') as hf:
    hf['type_1'] = data

When I tried the same script to download data again, it says:

[Errno 24] Too many open files: ...

There are some posts suggesting using ulimit -n 1200 for example in Linux to overcome the problem, but unfortunately I'm using Windows.

Besides, I think I already close files explicitly using with closure, especially in Method 2. How come iPython still count these files as open?

My loop is sth like below:

univ = pd.read_excel(univ_file, univ_tab)
for dt in pd.DatetimeIndex(start=start_date, end=end_date, freq='B'):
    for t in univ:
        data = download_data(t, dt)
        with pd.HDFStore(data_file, 'a') as hf:
            # Use pd.DataFrame([np.nan]) instead of pd.DataFrame() to save space
            hf[typ] = EMPTY_DF if data.shape[0] == 0 else data

Upvotes: 3

Views: 1087

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

You can check / list all open files belonging to Python process in Windows using psutil module.

Demo:

In [52]: [proc.open_files() for proc in psutil.process_iter() if proc.pid == os.getpid()]
Out[52]:
[[popenfile(path='C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1),
  popenfile(path='C:\\Users\\Max\\.ipython\\profile_default\\history.sqlite-journal', fd=-1),
  popenfile(path='C:\\Users\\Max\\.ipython\\profile_default\\history.sqlite', fd=-1)]]

a file handler will be closed as soon as we are done with the following block:

In [53]: with pd.HDFStore('d:/temp/1.h5', 'a') as hf:
   ....:     hf['df2'] = df
   ....:

prove:

In [54]: [proc.open_files() for proc in psutil.process_iter() if proc.pid == os.getpid()]
Out[54]:
[[popenfile(path='C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1),
  popenfile(path='C:\\Users\\Max\\.ipython\\profile_default\\history.sqlite', fd=-1)]]

check whether psutil works properly at all (pay attention at the D:\\temp\\aaa):

In [55]: fd = open('d:/temp/aaa', 'w')

In [56]: [proc.open_files() for proc in psutil.process_iter() if proc.pid == os.getpid()]
Out[56]:
[[popenfile(path='C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1),
  popenfile(path='D:\\temp\\aaa', fd=-1),
  popenfile(path='C:\\Users\\Max\\.ipython\\profile_default\\history.sqlite', fd=-1)]]

In [57]: fd.close()

In [58]: [proc.open_files() for proc in psutil.process_iter() if proc.pid == os.getpid()]
Out[58]:
[[popenfile(path='C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1),
  popenfile(path='C:\\Users\\Max\\.ipython\\profile_default\\history.sqlite', fd=-1)]]

So using this technique you can debug your code and find the place where the number of open files goes crazy in your case

Upvotes: 1

Related Questions