mhdjunaid
mhdjunaid

Reputation: 21

How can we convert an HadoopRDD result into Parquet format?

I'm attempting to read a DynamodDB table using Apache Spark.

Following is my implementation:

So in the Spark Shell

spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.dynamodb.DynamoDBItemWritable

/* Importing DynamoDBInputFormat and DynamoDBOutputFormat */ 
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat 
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat 
import org.apache.hadoop.mapred.JobConf 
import org.apache.hadoop.io.LongWritable   
var jobConf = new JobConf(sc.hadoopConfiguration) 
jobConf.set("dynamodb.servicename", "dynamodb") 
jobConf.set("dynamodb.input.tableName", "myDynamoDBTable")

// Pointing to DynamoDB table 
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com") 
jobConf.set("dynamodb.regionid", "us-east-1") jobConf.set("dynamodb.throughput.read", "1")
jobConf.set("dynamodb.throughput.read.percent", "1")
jobConf.set("dynamodb.version", "2011-12-05")  
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")  
var orders = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])

We get the result in "orders" variable.

How can I convert this result into Parquet File or Format?

Update: I found this piece of code to access and convert dynamodb data https://github.com/onzocom/spark-dynamodb/blob/master/src/main/scala/com/onzo/spark/dynamodb/DynamoDbRelation.scala?

Upvotes: 1

Views: 376

Answers (1)

Krishna Kalyan
Krishna Kalyan

Reputation: 1702

Dataframes can be saved as Parquet Files, but RDD's cannot. This is because Parquet Files requires a schema. RDD's aren't required to have a schema, but dataframes must.

Upvotes: 1

Related Questions