Vince Gonzalez
Vince Gonzalez

Reputation: 380

Google Cloud Dataflow, TextIO and Kerberized HDFS

I am trying to use Beam Java 2.22.0 on the Dataflow runner to read TSV files from kerberized HDFS. I'm using a Dataproc cluster with the kerberos component to provide the kerberized HDFS. The error I get is:

Error message from worker: org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]

I am configuring the pipeline as follows (note that I've configured the java.security.krb5.realm/kdc which I believe make a krb5.conf unnecessary on the workers. My HdfsTextIOOptions extends HadoopFileSystemOptions, which lets me initialize the pipeline with my Hadoop config.

I am obtaining a (currently unencrypted) keytab from a GCS location, and using that to initialize UserGroupInformation.

  public static void main(String[] args) throws IOException {
    System.setProperty("java.security.krb5.realm", "MY_REALM");
    System.setProperty("java.security.krb5.kdc", "my.kdc.hostname");

    HdfsTextIOOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(
            HdfsTextIOOptions.class);

    Storage storage = StorageOptions.getDefaultInstance().getService();
    URI uri = URI.create(options.getGcsKeytabPath());
    System.err.println(String
        .format("URI: %s, filesystem: %s, bucket: %s, filename: %s", uri.toString(),
            uri.getScheme(), uri.getAuthority(),
            uri.getPath()));
    Blob keytabBlob = storage.get(BlobId.of(uri.getAuthority(),
        uri.getPath().startsWith("/") ? uri.getPath().substring(1) : uri.getPath()));
    Path localKeytabPath = Paths.get("/tmp", uri.getPath());
    System.err.println(localKeytabPath);

    keytabBlob.downloadTo(localKeytabPath);

    Configuration conf = new Configuration();
    conf.set("fs.defaultFS", "hdfs://namenode:8020");
    conf.set("hadoop.security.authentication", "kerberos");

    UserGroupInformation
        .loginUserFromKeytab(options.getUserPrincipal(), localKeytabPath.toString());
    UserGroupInformation.setConfiguration(conf);

    options.setHdfsConfiguration(ImmutableList.of(conf));

    Pipeline p = Pipeline.create(options);

    p.apply(TextIO.read().from(options.getInputFile()))
    ...

Am I missing some essential bit of configuration to properly access to kerberized HDFS from Beam on Dataflow?

Thanks!

Upvotes: 1

Views: 476

Answers (1)

Lukasz Cwik
Lukasz Cwik

Reputation: 1731

It looks like you are setting system properties in your pipeline at construction time. You'll need to make sure these properties are set during pipeline execution as well.

A simple way to do this is to write your own JvmInitializer which sets these properties. The worker will instantiate your JvmInitialier using Java's ServiceLoader.

Upvotes: 1

Related Questions