How to handle large in-memory data in Apache Beam Pipeline to run on Google Dataflow Runner

Question

I'm having a simple following code. The size of the variable word_to_id in memory is ~50MB. This causing error in submitting pipeline to Dataflow Runner.

413 Request Entity Too Large

  word_to_id = {tok: idx for idx, tok in enumerate(vocab)}

  def extract_word_ids(tokens):
    return [word_to_id[w] for w in tokens if word_to_id.get(w, None)]

  with beam.pipeline.Pipeline(
    options=get_pipeline_option()) as p:
    lines = p | 'Read' >> beam.io.ReadFromText(path)

    word_ids = (
        lines
        | 'TokenizeLines' >> beam.Map(words)
        | 'IntergerizeTokens' >> beam.Map(extract_word_ids)
    )

Please provide me an alternate solution for this.

Dilip Sharma · Accepted Answer

Finally, I'm managed to solve it and it worked. I used DoFn.setup to initialize my variable from GCS bucket.

class IntergerizeTokens(beam.DoFn):
  """Beam line processing function."""

  def __init__(self, vocab_filename):
    self.vocab_filename = vocab_filename

  def setup(self):
    with tf.io.gfile.GFile(tf.io.gfile.glob(self.vocab_filename + '*')[0], 'r') as fh:
      # read from GCS bucket
    self.word_to_id = {tok: idx for idx, tok in enumerate(vocab)}
    print('Setup done!')

  def process(self, tokens):
    """Takes a lines and yield a list of (token, 1) tuples."""
    return [[self.word_to_id[w] for w in tokens if self.word_to_id.get(w, None)]]

Now pass the DoFn in ParDo

  with beam.pipeline.Pipeline(
    options=get_pipeline_option()) as p:
    lines = p | 'Read' >> beam.io.ReadFromText(path)

    word_ids = (
        lines
        | 'TokenizeLines' >> beam.Map(words)
        | 'IntergerizeTokens' >> beam.ParDo(IntergerizeTokens(vocab_temp_path))
    )

This is one way to solve it. I think DoFn.setup is good for initializing large variables in memory.

How to handle large in-memory data in Apache Beam Pipeline to run on Google Dataflow Runner

Answers (2)

Related Questions