user568109
user568109

Reputation: 48003

Create dataframe from rdd objectfile

What is the method to create ddf from an RDD which is saved as objectfile. I want to load the RDD but I don't have a java object, only a structtype I want to use as schema for ddf.

I tried retrieving as Row

    val myrdd = sc.objectFile[org.apache.spark.sql.Row]("/home/bipin/"+name)

But I get

java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to org.apache.spark.sql.Row

Is there a way to do this.

Edit

From what I understand, I have to read rdd as array of objects and convert it to row. If anyone can give a method for this, it would be acceptable.

Upvotes: 0

Views: 2103

Answers (2)

jlopezmat
jlopezmat

Reputation: 930

If you have an Array of Object you only have to use the Row apply method for an array of Any. In code will be something like this:

val myrdd = sc.objectFile[Array[Object]]("/home/bipin/"+name).map(x => Row(x))

EDIT

you are rigth @user568109 this will create a Dataframe with only one field that will be an Array to parse the whole array you have to do this:

val myrdd = sc.objectFile[Array[Object]]("/home/bipin/"+name).map(x => Row.fromSeq(x.toSeq))

As @user568109 said there are other ways to do this:

val myrdd = sc.objectFile[Array[Object]]("/home/bipin/"+name).map(x => Row(x:_*))

No matters which one you will because both are wrappers for the same code:

  /**
   * This method can be used to construct a [[Row]] with the given values.
   */
   def apply(values: Any*): Row = new GenericRow(values.toArray)

  /**
   * This method can be used to construct a [[Row]] from a [[Seq]] of values.
   */
   def fromSeq(values: Seq[Any]): Row = new GenericRow(values.toArray)

Upvotes: 1

Akash
Akash

Reputation: 411

Let me add some explaination,

suppose you have a mysql table grocery with 3 columns (item,category,price) and its contents as below

+------------+---------+----------+-------+
| grocery_id | item    | category | price |
+------------+---------+----------+-------+
|          1 | tomato  | veg      |  2.40 |
|          2 | raddish | veg      |  4.30 |
|          3 | banana  | fruit    |  1.20 |
|          4 | carrot  | veg      |  2.50 |
|          5 | apple   | fruit    |  8.10 |
+------------+---------+----------+-------+
5 rows in set (0.00 sec)

Now, within spark you want to read it, your code will be something like below

val groceryRDD = new JdbcRDD(sc, ()=> DriverManager.getConnection(url,uname,passwd), "select item,price from grocery limit ?,?",1,10,2,r => r.getString("item")+"|"+r.getString("price"))

Note : In the above statement i converted the ResultSet into String r => r.getString("item")+"|"+r.getString("price")

So my JdbcRDD will be as

groceryRDD: org.apache.spark.rdd.JdbcRDD[String] = JdbcRDD[29] at JdbcRDD at <console>:21

now you save it.

groceryRDD.saveAsObjectFile("/user/cloudera/jdbcobject")

Answer to your question

while reading the object file you need to write as below,

val newJdbObjectFile = sc.objectFile[String]("/user/cloudera/jdbcobject")

In a blind manner ,just substitute the type Parameter of RDD you are saving.

In my case, groceryRDD has a type parameter as String, hence i have used the same

UPDATE:

In your case, as mentioned by jlopezmat, you need to use Array[Object]

Here each row of RDD will be Object, but since you have converted that using ObjectArray each row with its contents will be again saved as Array,

i.e, In my case , if save above RDD as below,

val groceryRDD = new JdbcRDD(sc, ()=> DriverManager.getConnection(url,uname,passwd), "select item,price from grocery limit ?,?",1,10,2,r => JdbcRDD.resultSetToObjectArray(r))

when i read the same using and collect data

val newJdbcObjectArrayRDD = sc.objectFile[Array[Object]]("...")
val result = newJdbObjectArrayRDD.collect

result will be of type Array[Array[Object]]

result: Array[Array[Object]] = Array(Array(raddish, 4.3), Array(banana, 1.2), Array(carrot, 2.5), Array(apple, 8.1))

you can parse the above based on your column definitions.

Please let me know if it answered you question

Upvotes: 0

Related Questions