sds
sds

Reputation: 60004

Extract information from a `org.apache.spark.sql.Row`

I have Array[org.apache.spark.sql.Row] returned by sqc.sql(sqlcmd).collect():

Array([10479,6,10], [8975,149,640], ...)

I can get the individual values:

scala> pixels(0)(0)
res34: Any = 10479

but they are Any, not Int.

How do I extract them as Int?

The most obvious solution did not work:

scala> pixels(0).getInt(0)
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Int

PS. I can do pixels(0)(0).toString.toInt or pixels(0).getString(0).toInt, but they feel wrong...

Upvotes: 16

Views: 40150

Answers (3)

Pankaj Narang
Pankaj Narang

Reputation: 61

the answer is relevant. you dont need to use collect instead you need to call the methods getInt getString and getAs as well in case the datatype is complex

val popularHashTags = sqlContext.sql("SELECT hashtags, usersMentioned, Url FROM tweets")
var hashTagsList =  popularHashTags.flatMap ( x => x.getAs[Seq[String]](0)) 

Upvotes: 0

Justin Pihony
Justin Pihony

Reputation: 67065

Using getInt should work. Here is a contrived example as a proof of concept

import org.apache.spark.sql._
sc.parallelize(Array(1,2,3)).map(Row(_)).collect()(0).getInt(0)

This return 1

However,

sc.parallelize(Array("1","2","3")).map(Row(_)).collect()(0).getInt(0)

fails. So, it looks like it is coming in as a string and you will have to convert to an int manually.

sc.parallelize(Array("1","2","3")).map(Row(_)).collect()(0).getString(0).toInt

The documentation states that getInt:

Returns the value of column i as an int. This function will throw an exception if the value is at i is not an integer, or if it is null.

So, it will not try to cast for you it seems

Upvotes: 14

tgpfeiffer
tgpfeiffer

Reputation: 1838

The Row class (also see https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.package) has methods getInt(i: Int), getDouble(i: Int) etc.

Also note that a SchemaRDD is an RDD[Row] plus a schema that tells you which column has which data type. If you do .collect() you will only get an Array[Row] which does not have that information. So unless you know for sure what your data looks like, get the schema from the SchemaRDD, then collect the rows and then access each field using the correct type information.

Upvotes: 2

Related Questions