Apache Beam dataflow combine per key

Question

I have a problem with my pipeline. My goal is a read around 4k parquet files read it as a numpy array and then make some aggregations eg from one file can make 100 keys each key has some numbers of data. Then I have combine per key logic and my goal is reduce each file and for each key and get one value. On smaller dataset it works good but when I ran it with a bigger dataset I am getting two kind of issues one is OOM and second is a hot heys. I think problem is that each key have same number of data but Matrix for some keys is much bigger than other.

I tried with hot key fan-out is a bit better but problem still occur. Do you know how to find correct value for fanout.

Apache Beam dataflow combine per key

Answers (0)

Related Questions