Sujit
Sujit

Reputation: 2441

Multiple threads reading from single folder on Linux

My projects needs multiple threads reading files from the same folder. This folder has incoming files and the file should only be processed by any one of those threads. Later, this file reading thread, deletes the file after processing it.

EDIT after the first answer: I don't want a single thread in charge of reading filenames and feeding those names to other threads, so that they can read it.

Is there any efficient way of achieving this in python?

Upvotes: 0

Views: 1019

Answers (2)

mac
mac

Reputation: 43031

You should probably use the Queue module. From the docs:

The Queue module implements multi-producer, multi-consumer queues. It is especially useful in threaded programming when information must be exchanged safely between multiple threads.

I would use a FIFO approach, with a thread in charge of checking for inbound files and queuing them, and a number of workers processing them. A LIFO approach or an approach in which priority is assigned with a custom method are also supported by the module.


EDIT: If you don't want to use the Queue module and you are under a *nix system, you could use fcntl.lockf instead. An alternative, opening the files with os.open('filename', os.O_EXLOCK).

Depending on how often you perform this operation, you might find it less performing than using Queue, as you will have to account for race conditions (i.e.: you might acquire the name of the file to open, but the file might get locked by another thread before you get a chance to open it, throwing an exception that you will have to trap). Queue is there for a reason! ;)


EDIT2: Comments in this and other questions are bringing up the problem with simultaneous disk access to different files and the consequent performance hit. I was thinking that task_done would have been used for preventing this, but reading others' comments it occurred to me that instead of queuing file names, one could queue the files' content directly. This second alternative would work only for a limited amount of limited size queued files, given that RAM would fill up rather quickly otherwise.

I'm unaware if RAID and other parallel disk configurations would already take care of reading one file per disk rather than bouncing back and forth between two files on both disks.

HTH!

Upvotes: 2

Tudor
Tudor

Reputation: 62439

If you want multiple threads to read directly from the same folder several files in parallel, then I must disappoint you. Reading in parallel from a single disk is not a viable option. A single disk needs to spin and seek the next location to be read. If you're reading with multiple threads, you are just bouncing the disk around between seeks and the performance is much worse than a simple sequential read.

Just stick to mac's advice and use a single thread for reading.

Upvotes: 1

Related Questions