Pytorch data pipeline

Question

I am trying to implement a bounded buffer like solution where data generator and the model work as two separate processes. The data generator preprocess the data and stores in a shared queue (with predefined max size to limit the memory usage). The model on the other hand consumes data from this queue at its own pace until the queue is empty. Below is the snippet of my implementation.

'''
    self._buffer is an object of multiprocessing.Queue
'''
def produce(self):
    for obj in self._generator:
        self._buffer.put(obj=obj, block=True, timeout=None)
    self._buffer.put(obj=None)

def consume(self):
    while True:
        dat = self._buffer.get(block=True, timeout=None)
        if dat is None:
            break
        else:
            # Train model on `dat`

    def run(self):
        pt = multiprocessing.Process(target=self.produce)
        ct = multiprocessing.Process(target=self.consume)
        pt.start()
        ct.start()
        pt.join()
        ct.join()

However, the solution above does not work. I used the torch.multiprocessing as instructed the documentation. I also set torch.multiprocessing.set_start_method('spawn') in order to avoid "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"

But now I get "TypeError: cannot pickle 'generator' object". How this can be fixed?

Nopileos · Accepted Answer

Since you work with pytorch you should use the Dataset and Dataloader approach. This handles all problems with multiprocessing, shared memory and so on for you.

You can have map style datasets or things like iterable-style.Best to read the official documentation, what is what and how they work.

In your case you probably are fine with an iterable-style dataset. I used both approaches for similar cases. You can have the iterable style dataset, which you might need if you don't know how much samples you will be processing. For other cases I had a map-style dataset, where I knew the total number of my samples beforehand (e.g. processing all images in a directory) and could use a sequential sampler to give me the elements in order.

Regarding one of your problems. All errors like this TypeError: cannot pickle 'generator' object happen when you have objects which can't be serialized. For serialization pickle is used. In your case self._generator seems to be an object which can't be serialized for some reason. Without code it is not possible to say why. I had cases where used wrapped c++ packages created with pybind where objects were not serializable or I had some mutex variables somewhere.

Pytorch data pipeline

Answers (1)

Related Questions