VIjay
VIjay

Reputation: 117

Multithreading in Apache Beam : Reading Files in Seperate Threads

We have a requirement to create separate threads for reading multiple files.

  1. Thread 1 can read file 1 and create PCollection<String>. Can I execute a Pardo Operation in a multithreaded environment. and create a PCollection < String,String > from PCollection< String >?
  2. Thread 2 and complete the same operation from Thread 1 but on a different file File 2.
  3. Join output of File1 and File 2 in the main thread after Thread 1 and Thread 2 operation is completed.

Could you please tell whether this is possible and it is a recommended approach?

Upvotes: 3

Views: 2665

Answers (1)

Pablo
Pablo

Reputation: 11031

It sounds like what you want can be done with Beam. In the Beam model, you do not define how you want your operations to run, but rather, what operations you want to perform; then Beam, and the underlying runner takes care of managing threads.

That's why you generally shouldn't manage your own threads to read files in Beam. You should use TextIO to read from plain text files, and the TextIO module should read the files in parallel.

There are a few cases when your files will not be able to be read in parallel:

  1. Your files are compressed. This means that the file needs to be simultaneously decompressed and read, and can be read from different offsets simultaneously.
  2. You have too many files (1000s). If you have thousands or tens of thousands of files, you may want to use TextIO.readAll instead of the normal TextIO implementation, because keeping track of thousands of files that are being read in parallel can overwhelm the system.

Let me know if you are using non-plain text files, or other kind of source.

Upvotes: 2

Related Questions