Reputation: 21
I'm attempting to read a DynamodDB
table using Apache Spark.
Following is my implementation:
So in the Spark Shell
spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar
import org.apache.hadoop.io.Text;
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
/* Importing DynamoDBInputFormat and DynamoDBOutputFormat */
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io.LongWritable
var jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("dynamodb.servicename", "dynamodb")
jobConf.set("dynamodb.input.tableName", "myDynamoDBTable")
// Pointing to DynamoDB table
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com")
jobConf.set("dynamodb.regionid", "us-east-1") jobConf.set("dynamodb.throughput.read", "1")
jobConf.set("dynamodb.throughput.read.percent", "1")
jobConf.set("dynamodb.version", "2011-12-05")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
var orders = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
We get the result in "orders" variable.
How can I convert this result into Parquet File or Format?
Update: I found this piece of code to access and convert dynamodb data https://github.com/onzocom/spark-dynamodb/blob/master/src/main/scala/com/onzo/spark/dynamodb/DynamoDbRelation.scala?
Upvotes: 1
Views: 376
Reputation: 1702
Dataframes can be saved as Parquet Files, but RDD's cannot. This is because Parquet Files requires a schema. RDD's aren't required to have a schema, but dataframes must.
Upvotes: 1