ZygD
ZygD

Reputation: 24356

Struct data type when creating dataframe with createDataFrame in Scala

In PySpark, we can create struct data type when using createDataFrame like in the following example ("b", "c") and ("e", "f")

df = spark.createDataFrame([
    ["a", ("b", "c")],
    ["d", ("e", "f")]
])

df.printSchema()
# root
#  |-- _1: string (nullable = true)
#  |-- _2: struct (nullable = true)
#  |    |-- _1: string (nullable = true)
#  |    |-- _2: string (nullable = true)
df.show()
# +---+------+
# | _1|    _2|
# +---+------+
# |  a|{b, c}|
# |  d|{e, f}|
# +---+------+

Is there a similar way in Scala - to create struct schema inside createDataFrame, without using org.apache.spark.sql.functions?

Upvotes: 0

Views: 794

Answers (1)

Derek Plautz
Derek Plautz

Reputation: 108

For your specific example, you can use tuples and call this flavor of createDataFrame.

val df = spark.createDataFrame(Seq(
  ("a", ("b", "c")),
  ("d", ("e", "f"))
))

df.printSchema()
/*
root
 |-- _1: string (nullable = true)
 |-- _2: struct (nullable = true)
 |    |-- _1: string (nullable = true)
 |    |-- _2: string (nullable = true)
*/

df.show()
/*
+---+------+
| _1|    _2|
+---+------+
|  a|[b, c]|
|  d|[e, f]|
+---+------+
*/

Instead of ("b", "c") one can also use "b" -> "c" to create a tuple of length 2.

Preferred method

Tuples can become difficult to manage when dealing with many fields and especially nested fields. Likely, you'll want to model your data using case class(s). This also allows to specify struct field names and types.

case class Person(name: String, age: Int)
case class Car(manufacturer: String, model: String, mileage: Double, owner: Person)

val df = spark.createDataFrame(Seq(
  Car("Toyota", "Camry", 81400.8, Person("John", 37)),
  Car("Honda", "Accord", 152090.2, Person("Jane", 25))
))

df.printSchema()
/*
root
 |-- manufacturer: string (nullable = true)
 |-- model: string (nullable = true)
 |-- mileage: double (nullable = false)
 |-- owner: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- age: integer (nullable = false)
*/

df.show()
/*
+------------+------+--------+----------+
|manufacturer| model| mileage|     owner|
+------------+------+--------+----------+
|      Toyota| Camry| 81400.8|[John, 37]|
|       Honda|Accord|152090.2|[Jane, 25]|
+------------+------+--------+----------+
*/

Upvotes: 2

Related Questions