Reputation: 1409
How can I create Dataset using StructType
?
We can create a Dataset
as follows:
case class Person(name: String, age: Int)
val personDS = Seq(Person("Max", 33), Person("Adam", 32), Person("Muller",
62)).toDS()
personDS.show()
Is there a way to create a Dataset
without using a case class?
I'd like to create a DataFrame
using a case class and using StructType
.
Upvotes: 4
Views: 11450
Reputation: 19348
Here's how you can create the Dataset with a StructType:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val schema = StructType(Seq(
StructField("name", StringType, true),
StructField("age", IntegerType, true)
))
val data = Seq(
Row("Max", 33),
Row("Adam", 32),
Row("Muller", 62)
)
val personDF = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)
val yourDS = personDF.as[(String, Int)]
yourDS.show()
+------+---+
| name|age|
+------+---+
| Max| 33|
| Adam| 32|
|Muller| 62|
+------+---+
yourDS
is a org.apache.spark.sql.Dataset[(String, Int)]
.
The personDS
in your question is of type org.apache.spark.sql.Dataset[Person]
, so this doesn't quite give the same result.
See this post for more info on how to create Datasets.
Upvotes: 0
Reputation: 74779
That's an interesting question in a sense that I don't see a reason why one would want it.
How can I create Dataset using "StructType"
I'd then ask a very similar question...
Why would you like to "trade" a case class with a
StructType
? What would that give you that a case class could not?
The reason you use a case class is that it can offer you two things at once:
Describe your schema quickly, nicely and type-safely
Working with your data becomes type-safe
Regarding 1. as a Scala developer, you will define business objects that describe your data. You will have to do it anyway (unless you like tuples and _1
and such).
Regarding type-safety (in both 1. and 2.) is about transforming your data to leverage the Scala compiler that can help find places where you expect a String but have an Int. With StructType
the check is only at runtime (not compile time).
With all that said, the answer to your question is "Yes".
You can create a Dataset using StructType
.
scala> val personDS = Seq(("Max", 33), ("Adam", 32), ("Muller", 62)).toDS
personDS: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]
scala> personDS.show
+------+---+
| _1| _2|
+------+---+
| Max| 33|
| Adam| 32|
|Muller| 62|
+------+---+
You may be wondering why I don't see the column names. That's exactly the reason for a case class that would not only give you the types, but also the names of the columns.
There's one trick you can use however to avoid dealing with case classes if you don't like them.
val withNames = personDS.toDF("name", "age").as[(String, Int)]
scala> withNames.show
+------+---+
| name|age|
+------+---+
| Max| 33|
| Adam| 32|
|Muller| 62|
+------+---+
Upvotes: 3
Reputation: 16096
If you know how to create DataFrame, you already now how to create Dataset :)
DataFrame = Dataset[Row].
What it means? Try:
val df : DataFrame = spark.createDataFrame(...) // with StructType
import org.apache.spark.sql._
val ds : Dataset[Row] = df; // no error, as DataFrame is only a type alias of Dataset[Row]
Upvotes: 7