Reputation: 17
Intention: I have a bounded data set (Rest API exposed, 10k records) to be written to BigQuery with some additional steps. As my data set is bounded I've implemented BoundedSource interface to read records in my apache beam pipeline.
Problem: all 10k records are read in one shot (one shot for write to BigQiery as well). But I want to query small part (for example 200 rows), process, save to BigQuery and then query next 200 rows.
I've found that I can use windowing with bounded PCollections, but windows are created on time basis (every 10 sec for example) and I want it to be on record counter basis (every 200 records)
Question: How can I implement the mentioned splitting to batches/windows with 200 records size? Am I missing something?
The question is similar to this but it wasn't answered
Upvotes: 0
Views: 742
Reputation: 5104
Given a PCollection of rows, you can use GroupIntoBatches to batch these up into a PCollection of sets of rows of a given size.
As for reading your input in an incremental way, you can use the split method of BoundedSource to shard your read into several pieces which will then be read independently (possibly on separate workers). For a bounded pipeline, this will still happen in its entirety (all 10k records read) before anything is written, but need not happen on a single worker. You could also insert a Reshuffle to decouple the parallelism between your read and your write.
Upvotes: 1