Reputation: 29687
Is there a way to set an S3 Object's metadata (I want to set the tag) when writing an RDD to S3 from Spark? The examples that I find (such as Amazon's and Spark set S3 object metadata while writing to EMRFS) are for DataFrames, not RDDs.
Upvotes: 2
Views: 2142
Reputation: 9
This solution looks like vandalism and can hardly be used in production, but it works for Dataset and theoretically should work for RDD.
plugins {
id 'java'
id 'com.github.johnrengelman.shadow' version '8.1.1'
}
apply plugin: 'com.github.johnrengelman.shadow'
jar.enabled = false
version = '0.0.DEBUG'
sourceCompatibility = 1.8
repositories {
maven {
url 'https://s3.us-east-1.amazonaws.com/us-east-1-emr-artifacts/emr-6.6.0/repos/maven/'
}
mavenCentral()
}
ext['aws_version'] = '1.12.170'
// For emr spark and hadoop versions please refer https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-650-release.html
//https://s3.console.aws.amazon.com/s3/buckets/us-east-1-emr-artifacts?prefix=emr-6.6.0/repos/maven/org/apache/hadoop/®ion=us-east-1
ext['hadoop_version'] = '3.2.1-amzn-6'
ext['spark_version'] = '3.2.0'
task unzipEmrHadoopAssembly(type: Copy) {
from zipTree('libs/emrfs-hadoop-assembly-2.50.0.jar')
into("$buildDir/libs/emrfs-hadoop-assembly")
include "**/*"
exclude "**/S3ObjectRequestFactory.class"
}
dependencies {
shadow("org.apache.spark:spark-core_2.12:$spark_version")
shadow("org.apache.spark:spark-sql_2.12:$spark_version")
shadow("org.apache.hadoop:hadoop-aws:$hadoop_version")
shadow("org.apache.hadoop:hadoop-common:$hadoop_version")
shadow("org.apache.hadoop:hadoop-client:$hadoop_version")
//lib from the cluster from /lib/spark/jars/emrfs-hadoop-assembly-2.50.0.jar
implementation files("$buildDir/libs/emrfs-hadoop-assembly") {
builtBy 'unzipEmrHadoopAssembly'
}
testImplementation('junit:junit:4.12')
}
assemble.dependsOn shadowJar
shadowJar {
manifest {
attributes(
'Main-Class': 'com.mycompany.Runner'
)
}
relocate("com.amazon.ws.emr.hadoop.fs", "com.amazon.mycompany.emr.hadoop.fs")
archiveFileName = "${project.name}-${project.version}.jar"
}
com.amazon.ws.emr.hadoop.fs.s3.S3ObjectRequestFactory
and change methods newPutObjectRequest
, newCopyObjectRequest
, newMultipartUploadRequest
to set tags to your S3 object.sparkConf.set("spark.hadoop.fs.s3.impl", "com.amazon.mycompany.emr.hadoop.fs.EmrFileSystem")
when you need to create S3 files with tags.Upvotes: 0
Reputation: 13480
Not in the s3a codebase as of March 9, 2021. no idea about EMR's s3 connector
update Feb 2023. Hadoop 3.3.5+ s3a connector
createFile()
builder API (not through RDD API unless someone wires it up)fs.s3a.object.content.encoding
to set the encoding ... set it through spark.hadoop.fs.s3a.object.content.encoding to be picked up on all files HADOOP-17851Before anyone asks "will this be backported?" the answer is: Not by the hadoop developers. Everyone is free to make their private forks of old releases and cherrypick whatever they want, but don't expect others to do it for you.
Upvotes: 3