Reputation: 1944
I have an array called array list which looks like this
arraylist: Array[(String, Any)] = Array((id,772914), (x4,2), (x5,24), (x6,1), (x7,77491.25), (x8,17911.77778), (x9,225711), (x10,17), (x12,6), (x14,5), (x16,5), (x18,5.0), (x19,8.0), (x20,7959.0), (x21,676.0), (x22,228.5068871), (x23,195.0), (x24,109.6015511), (x25,965.0), (x26,1017.79043), (x27,2.0), (Target,1), (x29,13), (x30,735255.5), (x31,332998.432), (x32,38168.75), (x33,107957.5278), (x34,13), (x35,13), (x36,13), (x37,13), (x38,13), (x39,13), (x40,13), (x41,7), (x42,13), (x43,13), (x44,13), (x45,13), (x46,13), (x47,13), (x48,13), (x49,14.0), (x50,2.588435821), (x51,617127.5), (x52,414663.9738), (x53,39900.0), (x54,16743.15781), (x55,105000.0), (x56,52842.29076), (x57,25750.46154), (x58,8532.045819), (x64,13), (x66,13), (x67,13), (x68,13), (x69,13), (x70,13), (x71,13), (x73,13), (...
I want to convert it to a dataframe with two columns "ID" and value. Fo theis the code I am using is
val df = sc.parallelize(arraylist).toDF("Names","Values")
However I am getting an error
java.lang.UnsupportedOperationException: Schema for type Any is not supported
How can I overcome this problem?
Upvotes: 7
Views: 15996
Reputation: 13001
The problem (as stated) is that Any is not a legal type to dataframe. In general legal types are primitive types (byte, int, boolean, string, double etc.), structs of legal types, arrays of legal types and maps of legal types
In your case it seems as if you used both integer and double in the second value of the tuple. If you use instead just double then it should work properly.
you can do this in two ways: 1. Make sure the original array has just double (e.g. by adding .0 at the end of each integer when you create it) or by doing a cast 2. Enforce the schema:
import org.apache.spark.sql.types._
val schema = new StructType()
schema.add(StructField("names",StringType))
schema.add(StructField("values",DoubleType))
val rdd = sc.parallelize(arraylist).map (x => Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
val df = spark.createDataFrame(rdd,schema)
Upvotes: 2
Reputation: 16076
Message tells you everything :) Any is not supported as a type of column of DataFrame. Any
type can be caused by nulls as the second element of a tuple
Change arraylist type to Array[(String, Int)]
(if you can do it manually; if it is deducted by Scala, then check for nulls and invalid values of second element) or create manually schema:
import org.apache.spark.sql.types._
import org.apache.spark.sql._
val arraylist: Array[(String, Any)] = Array(("id",772914), ("x4",2.0), ("x5",24.0));
val schema = StructType(
StructField("Names", StringType, false) ::
StructField("Values", DoubleType, false) :: Nil)
val rdd = sc.parallelize (arraylist).map (x => Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
val df = sqlContext.createDataFrame(rdd, schema)
df.show()
Note: createDataFrame requires RDD[Row], so I'm converting RDD of tuple to RDD of Row
Upvotes: 11