Reputation: 48003
What is the method to create ddf from an RDD which is saved as objectfile. I want to load the RDD but I don't have a java object, only a structtype I want to use as schema for ddf.
I tried retrieving as Row
val myrdd = sc.objectFile[org.apache.spark.sql.Row]("/home/bipin/"+name)
But I get
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to org.apache.spark.sql.Row
Is there a way to do this.
Edit
From what I understand, I have to read rdd as array of objects and convert it to row. If anyone can give a method for this, it would be acceptable.
Upvotes: 0
Views: 2103
Reputation: 930
If you have an Array of Object you only have to use the Row apply method for an array of Any. In code will be something like this:
val myrdd = sc.objectFile[Array[Object]]("/home/bipin/"+name).map(x => Row(x))
EDIT
you are rigth @user568109 this will create a Dataframe with only one field that will be an Array to parse the whole array you have to do this:
val myrdd = sc.objectFile[Array[Object]]("/home/bipin/"+name).map(x => Row.fromSeq(x.toSeq))
As @user568109 said there are other ways to do this:
val myrdd = sc.objectFile[Array[Object]]("/home/bipin/"+name).map(x => Row(x:_*))
No matters which one you will because both are wrappers for the same code:
/**
* This method can be used to construct a [[Row]] with the given values.
*/
def apply(values: Any*): Row = new GenericRow(values.toArray)
/**
* This method can be used to construct a [[Row]] from a [[Seq]] of values.
*/
def fromSeq(values: Seq[Any]): Row = new GenericRow(values.toArray)
Upvotes: 1
Reputation: 411
Let me add some explaination,
suppose you have a mysql table grocery with 3 columns (item,category,price) and its contents as below
+------------+---------+----------+-------+
| grocery_id | item | category | price |
+------------+---------+----------+-------+
| 1 | tomato | veg | 2.40 |
| 2 | raddish | veg | 4.30 |
| 3 | banana | fruit | 1.20 |
| 4 | carrot | veg | 2.50 |
| 5 | apple | fruit | 8.10 |
+------------+---------+----------+-------+
5 rows in set (0.00 sec)
Now, within spark you want to read it, your code will be something like below
val groceryRDD = new JdbcRDD(sc, ()=> DriverManager.getConnection(url,uname,passwd), "select item,price from grocery limit ?,?",1,10,2,r => r.getString("item")+"|"+r.getString("price"))
Note : In the above statement i converted the ResultSet into String r => r.getString("item")+"|"+r.getString("price")
So my JdbcRDD will be as
groceryRDD: org.apache.spark.rdd.JdbcRDD[String] = JdbcRDD[29] at JdbcRDD at <console>:21
now you save it.
groceryRDD.saveAsObjectFile("/user/cloudera/jdbcobject")
Answer to your question
while reading the object file you need to write as below,
val newJdbObjectFile = sc.objectFile[String]("/user/cloudera/jdbcobject")
In a blind manner ,just substitute the type Parameter of RDD you are saving.
In my case, groceryRDD has a type parameter as String, hence i have used the same
UPDATE:
In your case, as mentioned by jlopezmat, you need to use Array[Object]
Here each row of RDD will be Object, but since you have converted that using ObjectArray each row with its contents will be again saved as Array,
i.e, In my case , if save above RDD as below,
val groceryRDD = new JdbcRDD(sc, ()=> DriverManager.getConnection(url,uname,passwd), "select item,price from grocery limit ?,?",1,10,2,r => JdbcRDD.resultSetToObjectArray(r))
when i read the same using and collect data
val newJdbcObjectArrayRDD = sc.objectFile[Array[Object]]("...")
val result = newJdbObjectArrayRDD.collect
result will be of type Array[Array[Object]]
result: Array[Array[Object]] = Array(Array(raddish, 4.3), Array(banana, 1.2), Array(carrot, 2.5), Array(apple, 8.1))
you can parse the above based on your column definitions.
Please let me know if it answered you question
Upvotes: 0