TKross
TKross

Reputation: 13

ModuleNotFoundError message when run gcp dataflow pipeline with python

I'm trying to install dependencies in a dataflow pipeline. First I used requirements_file flag but i get (ModuleNotFoundError: No module named 'unidecode' [while running 'Map(wordcleanfn)-ptransform-54']) the unique package added is unidecode. trying a second option I configured a Docker image following the google documentation:

FROM apache/beam_python3.10_sdk:2.52.0

ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1

RUN pip install unidecode

RUN apt-get update && apt-get install -y

ENTRYPOINT ["/opt/apache/beam/boot"]

It was compiled in the gcp project vm and pushed to artifact registry Then I generated the template for pipeline with:

python -m mytestcode \
    --project myprojectid \
    --region us-central1 \
    --temp_location gs://mybucket/beam_test/tmp/ \
    --runner DataflowRunner \
    --staging_location gs://mybucket/beam_test/stage_output/ \
    --template_name mytestcode_template \
    --customvariable 500 \
    --experiments use_runner_v2 \
    --sdk_container_image us-central1-docker.pkg.dev/myprojectid/myimagerepo/dataflowtest-image:0.0.1 \
    --sdk_location container

After all I created the job from template with the UI, but the error is the same, please someone can help me? I understand that the workers are using de default beam sdk, is correct that? how I can fix it?

Upvotes: 0

Views: 162

Answers (1)

You will get this error if you declare it globally at the top of the code. For example, let's consider you are performing unidecode library operation inside a ParDo Function. If that is the case, use the import statement inside the ParDo Function instead of importing it in the top line of code.

In my case, I imported the datetime library inside my ParDo Function:

class sample_function(beam.DoFn):
def process(self, element):
    from datetime import datetime
   

Upvotes: 0

Related Questions