code tutorial
code tutorial

Reputation: 664

Run pip install py4j in pythonEvaluator transform of GCP Data Fusion

I am trying to run "pip install py4j" in Native mode of Python Evaluator. I can't find the place where I can run this command to install the dependency. Unable to find the solution anywhere over the web. Please guide me to execute this command in data fusion.

Thanks in advance!

Upvotes: 1

Views: 992

Answers (2)

Gavin YANG
Gavin YANG

Reputation: 1

Yes, Tlaquetzal is right, basically, you have two ways to achieve this.

  1. Use the fixed cluster and set up the Remote Hadoop Provisioner in CDAP

  2. Create a custom image with the library.

    • Create a custom image with library doc
    #!/bin/bash
    apt-get update
    apt -y --force-yes install python3.7
    apt -y --force-yes  install python3-pip
    pip3 install py4j
    
    • Set up the customized image in CDAP compute profile as below

Upvotes: 0

Tlaquetzal
Tlaquetzal

Reputation: 2850

There's no straightforward approach for this, because you cannot modify the Dataproc cluster used in the execution within the pipeline. So, if you really need to use the Python plug-in in Native mode, my suggestion is to create a cluster with the py4j library, and then connect it to Data Fusion using the "Remote Hadoop provisioner".

Consider that to use this provisioner, you'll need to create a new Compute Profile, which is only available in Data Fusion Enterprise version.

To install the py4j library in your cluster, you can either create a custom image with the library, provide an initialization actions script to install it, or SSH into the machines and manually execute the pip install command.

Upvotes: 1

Related Questions