I have a lot of pickle files. Currently I read them in a loop but it takes a lot of time. I would like to speed it up but don't have any idea how to do that. Multiprocessing wouldn't work because in order to transfer data from a child subprocess to the main process data need to be serialized (pickled) and deserialized. Using threading wouldn't help either because of GIL. I think that the solution would be some library written in C that takes a list of files to read and then runs multiple threads (without GIL). Is there something like this around? UPDATE Answering your questions: Files are partial products of data processing for the purpose of ML There are pandas.Series objects but the dtype is not known upfront I want to have many files because we want to pick any subset easily I want to have many smaller files instead of one big file because deserialization of one big file takes more memory (at some point in time we have serialized string and deserialized objects) The size of the files can vary a lot I use python 3.7 so I believe it's cPickle in fact Using pickle is very flexible because I don't have to worry about underlying types - I can save anything

pythonpython-multiprocessingpython-multithreading

Reputation: 1038

Speed up reading multiple pickle files

I have a lot of pickle files. Currently I read them in a loop but it takes a lot of time. I would like to speed it up but don't have any idea how to do that.

Multiprocessing wouldn't work because in order to transfer data from a child subprocess to the main process data need to be serialized (pickled) and deserialized.

Using threading wouldn't help either because of GIL.

I think that the solution would be some library written in C that takes a list of files to read and then runs multiple threads (without GIL). Is there something like this around?

UPDATE Answering your questions:

Files are partial products of data processing for the purpose of ML
There are pandas.Series objects but the dtype is not known upfront
I want to have many files because we want to pick any subset easily
I want to have many smaller files instead of one big file because deserialization of one big file takes more memory (at some point in time we have serialized string and deserialized objects)
The size of the files can vary a lot
I use python 3.7 so I believe it's cPickle in fact
Using pickle is very flexible because I don't have to worry about underlying types - I can save anything

Upvotes: 18

Answers (5)

hrokr

Reputation: 3559

I agree with what has been noted in the comments, namely that due to the constraint of python itself (chiefly, the GIL lock, as you noted) and there may simply be no faster loading the information beyond what you are doing now. Or, if there is a way, it may be both highly technical and, in the end, only gives you a modest increase in speed.

That said, depending on the datatypes you have, it may be faster to use quickle or pyrobuf.

Upvotes: 5

Chris_Rands

Reputation: 41168

Consider using HDF5 via h5py instead of pickle. The performance is generally much better than pickle with numerical data in Pandas and numpy data structures and it supports most common data types and compression.

Upvotes: 0

SaGaR

Reputation: 542

I think you should try and use mmap(memory mapped files) that is similar to open() but way faster.

Note: If your each file is big in size then use mmap otherwise If the files are small in size use regular methods.

I have written a sample that you can try.

import mmap
from time import perf_counter as pf
def load_files(filelist):
    start = pf() # for rough time calculations
    for filename in filelist:
        with open(filename, mode="r", encoding="utf8") as file_obj:
            with mmap.mmap(file_obj.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_file_obj:
                data = pickle.load(mmap_file_obj)
                print(data)
    print(f'Operation took {pf()-start} sec(s)')

Here mmap.ACCESS_READ is the mode to open the file in binary. The file_obj returned by open is just used to get the file descriptor which is used to open the stream to the file via mmap as a memory mapped file. As you can see below in the documentation of python open returns the file descriptor or fd for short. So we don't have to do anything with the file_obj operation wise. We just need its fileno() method to get its file descriptor. Also we are not closing the file_obj before the mmap_file_obj. Please take a proper look. We are closing the the mmap block first. As you said in your comment.

open (file, flags[, mode])
Open the file file and set various flags according to flags and possibly its mode according to mode. 
The default mode is 0777 (octal), and the current umask value is first masked out. 
Return the file descriptor for the newly opened file.

Give it a try and see how much impact does it do on your operation You can read more about mmap here. And about file descriptor here

Upvotes: 2

Aaron

Reputation: 11075

I think that the solution would be some library written in C that takes a list of files to read and then runs multiple threads (without GIL). Is there something like this around?

In short: no. pickle is apparently good enough for enough people that there are no major alternate implementations fully compatible with the pickle protocol. As of sometime in python 3, cPickle was merged with pickle, and neither release the GIL anyway which is why threading won't help you (search for Py_BEGIN_ALLOW_THREADS in _pickle.c and you will find nothing).

If your data can be re-structured into a simpler data format like csv, or a binary format like numpy's npy, there will be less cpu overhead when reading your data. Pickle is built for flexibility first rather than speed or compactness first. One possible exception to the rule of more complex less speed is the HDF5 format using h5py, which can be fairly complex, and I have used to max out the bandwidth of a sata ssd.

Finally you mention you have many many pickle files, and that itself is probably causing no small amount of overhead. Each time you open a new file, there's some overhead involved from the operating system. Conveniently you can combine pickle files by simply appending them together. Then you can call Unpickler.load() until you reach the end of the file. Here's a quick example of combining two pickle files together using shutil

import pickle, shutil, os

#some dummy data
d1 = {'a': 1, 'b': 2, 1: 'a', 2: 'b'}
d2 = {'c': 3, 'd': 4, 3: 'c', 4: 'd'}

#create two pickles
with open('test1.pickle', 'wb') as f:
    pickle.Pickler(f).dump(d1)
with open('test2.pickle', 'wb') as f:
    pickle.Pickler(f).dump(d2)
    
#combine list of pickle files
with open('test3.pickle', 'wb') as dst:
    for pickle_file in ['test1.pickle', 'test2.pickle']:
        with open(pickle_file, 'rb') as src:
            shutil.copyfileobj(src, dst)
            
#unpack the data
with open('test3.pickle', 'rb') as f:
    p = pickle.Unpickler(f)
    while True:
        try:
            print(p.load())
        except EOFError:
            break
        
#cleanup
os.remove('test1.pickle')
os.remove('test2.pickle')
os.remove('test3.pickle')

Upvotes: 5

Rifat Alptekin Çetin

Reputation: 1298

You can try multiprocessing:

import os,pickle
pickle_list=os.listdir("pickles")

output_dict=dict.fromkeys(pickle_list, '')

def pickle_process_func(picklename):
    with open("pickles/"+picklename, 'rb') as file:
        dapickle=pickle.load(file)

    #if you need previus files output wait for it
    while(!output_dict[pickle_list[pickle_list.index(picklename)-1]]):
        continue

    #thandosomesh
    print("loaded")
    output_dict[picklename]=custom_func_i_dunno(dapickle)
    

from multiprocessing import Pool

with Pool(processes=10) as pool:
     pool.map(pickle_process_func, pickle_list)

Upvotes: 1

Speed up reading multiple pickle files

Answers (5)

Related Questions