Reputation: 13
I'm trying to install dependencies in a dataflow pipeline. First I used requirements_file flag but i get (ModuleNotFoundError: No module named 'unidecode' [while running 'Map(wordcleanfn)-ptransform-54']) the unique package added is unidecode. trying a second option I configured a Docker image following the google documentation:
FROM apache/beam_python3.10_sdk:2.52.0
ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1
RUN pip install unidecode
RUN apt-get update && apt-get install -y
ENTRYPOINT ["/opt/apache/beam/boot"]
It was compiled in the gcp project vm and pushed to artifact registry Then I generated the template for pipeline with:
python -m mytestcode \
--project myprojectid \
--region us-central1 \
--temp_location gs://mybucket/beam_test/tmp/ \
--runner DataflowRunner \
--staging_location gs://mybucket/beam_test/stage_output/ \
--template_name mytestcode_template \
--customvariable 500 \
--experiments use_runner_v2 \
--sdk_container_image us-central1-docker.pkg.dev/myprojectid/myimagerepo/dataflowtest-image:0.0.1 \
--sdk_location container
After all I created the job from template with the UI, but the error is the same, please someone can help me? I understand that the workers are using de default beam sdk, is correct that? how I can fix it?
Upvotes: 0
Views: 162
Reputation: 1
You will get this error if you declare it globally at the top of the code. For example, let's consider you are performing unidecode library operation inside a ParDo Function. If that is the case, use the import statement inside the ParDo Function instead of importing it in the top line of code.
In my case, I imported the datetime library inside my ParDo Function:
class sample_function(beam.DoFn):
def process(self, element):
from datetime import datetime
Upvotes: 0