Can I force a step in my dataflow pipeline to be single-threaded (and on a single machine)?

Question

I have a pipeline that takes URLs for files and downloads these generating BigQuery table rows for each line apart from the header.

To avoid duplicate downloads, I want to check URLs against a table of previously downloaded ones and only go ahead and store the URL if it is not already in this "history" table.

For this to work I need to either store the history in a database allowing unique values or it might be easier to use BigQuery for this also, but then access to the table must be strictly serial.

Can I enforce single-thread execution (on a single machine) to satisfy this for part of my pipeline only?

(After this point, each of my 100s of URLs/files would be suitable for processed on a separate thread; each single file gives rise to 10000-10000000 rows, so throttling at that point will almost certainly not give performance issues.)

Andrew Pilloud · Accepted Answer

Beam is designed for parallel processing of data and it tries to explicitly stop you from synchronizing or blocking except using a few built-in primitives, such as Combine.

It sounds like what you want is a filter that emits an element (your URL) only the first time it is seen. You can probably use the built-in Distinct transform for this. This operator uses a Combine per-key to group the elements by key (your URL in this case), then emits each key only the first time it is seen.

Can I force a step in my dataflow pipeline to be single-threaded (and on a single machine)?

Answers (1)

Related Questions