vy32
vy32

Reputation: 29687

Set S3 object metadata (tag) when writing RDD to S3 with Spark

Is there a way to set an S3 Object's metadata (I want to set the tag) when writing an RDD to S3 from Spark? The examples that I find (such as Amazon's and Spark set S3 object metadata while writing to EMRFS) are for DataFrames, not RDDs.

Upvotes: 2

Views: 2142

Answers (2)

vadym_naumenko
vadym_naumenko

Reputation: 9

This solution looks like vandalism and can hardly be used in production, but it works for Dataset and theoretically should work for RDD.

  1. Go to your EMR Cluster Master node and download /lib/spark/jars/emrfs-hadoop-assembly-2.50.0.jar
  2. Add that lib to /libs directory in your Spark application
  3. Change build.gradle to
plugins {
    id 'java'
    id 'com.github.johnrengelman.shadow' version '8.1.1'
}

apply plugin: 'com.github.johnrengelman.shadow'

jar.enabled = false

version = '0.0.DEBUG'

sourceCompatibility = 1.8

repositories {
    maven {
        url 'https://s3.us-east-1.amazonaws.com/us-east-1-emr-artifacts/emr-6.6.0/repos/maven/'
    }
    mavenCentral()
}

ext['aws_version'] = '1.12.170'
// For emr spark and hadoop versions please refer https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-650-release.html
//https://s3.console.aws.amazon.com/s3/buckets/us-east-1-emr-artifacts?prefix=emr-6.6.0/repos/maven/org/apache/hadoop/&region=us-east-1
ext['hadoop_version'] = '3.2.1-amzn-6'
ext['spark_version'] = '3.2.0'

task unzipEmrHadoopAssembly(type: Copy) {
    from zipTree('libs/emrfs-hadoop-assembly-2.50.0.jar')
    into("$buildDir/libs/emrfs-hadoop-assembly")
    include "**/*"
    exclude "**/S3ObjectRequestFactory.class"
}

dependencies {
    shadow("org.apache.spark:spark-core_2.12:$spark_version")
    shadow("org.apache.spark:spark-sql_2.12:$spark_version")

    shadow("org.apache.hadoop:hadoop-aws:$hadoop_version")
    shadow("org.apache.hadoop:hadoop-common:$hadoop_version")
    shadow("org.apache.hadoop:hadoop-client:$hadoop_version")

    //lib from the cluster from /lib/spark/jars/emrfs-hadoop-assembly-2.50.0.jar
    implementation files("$buildDir/libs/emrfs-hadoop-assembly") {
        builtBy 'unzipEmrHadoopAssembly'
    }

    testImplementation('junit:junit:4.12')
}

assemble.dependsOn shadowJar

shadowJar {
    manifest {
        attributes(
                'Main-Class': 'com.mycompany.Runner'
        )
    }
    relocate("com.amazon.ws.emr.hadoop.fs", "com.amazon.mycompany.emr.hadoop.fs")
    archiveFileName = "${project.name}-${project.version}.jar"
}
  1. Create class com.amazon.ws.emr.hadoop.fs.s3.S3ObjectRequestFactory and change methods newPutObjectRequest, newCopyObjectRequest, newMultipartUploadRequest to set tags to your S3 object.
  2. Use sparkConf.set("spark.hadoop.fs.s3.impl", "com.amazon.mycompany.emr.hadoop.fs.EmrFileSystem") when you need to create S3 files with tags.
  3. When you run your job, use --jar to include your new Hadoop File System to Class Path. Better to create a separate jar that will be a full copy of emrfs-hadoop-assembly-2.50.0.jar but will set a tag to your S3 object.

Upvotes: 0

stevel
stevel

Reputation: 13480

Not in the s3a codebase as of March 9, 2021. no idea about EMR's s3 connector

update Feb 2023. Hadoop 3.3.5+ s3a connector

  • lets you set headers using the createFile() builder API (not through RDD API unless someone wires it up)
  • has the option fs.s3a.object.content.encoding to set the encoding ... set it through spark.hadoop.fs.s3a.object.content.encoding to be picked up on all files HADOOP-17851

Before anyone asks "will this be backported?" the answer is: Not by the hadoop developers. Everyone is free to make their private forks of old releases and cherrypick whatever they want, but don't expect others to do it for you.

Upvotes: 3

Related Questions