Reputation: 6332
I have simple code:
test("Dataset as method") {
val spark = SparkSession.builder().master("local").appName("Dataset as method").getOrCreate()
import spark.implicits._
//xyz is an alias of ds1
val ds1 = Seq("1", "2").toDS().as("xyz")
//xyz can be used to refer to the value column
ds1.select($"xyz.value").show(truncate = false)
//ERROR here, no table or view named xyz
spark.sql("select * from xyz").show(truncate = false)
}
It looks to me that xyz
is like a table name, but the sql select * from xyz
raises an error complaining xyz
doesn't exist.
So, I want to ask, what does as
method really mean? and how I should use the alias,like xyz
in my case
Upvotes: 0
Views: 73
Reputation: 41957
.as()
when used with dataset
(as in your case) is a function to create alias
for a dataset
as you can see in the api doc
/**
* Returns a new Dataset with an alias set.
*
* @group typedrel
* @since 1.6.0
*/
def as(alias: String): Dataset[T] = withTypedPlan {
SubqueryAlias(alias, logicalPlan)
}
which can be used in function apis only such as select
, join
, filter
etc. But the alias cannot be used for sql queries.
It is more evident if you create two columns dataset and use alias as you did
val ds1 = Seq(("1", "2"),("3", "4")).toDS().as("xyz")
Now you can use select
to select only one column using the alias as
ds1.select($"xyz._1").show(truncate = false)
which should give you
+---+
|_1 |
+---+
|1 |
|3 |
+---+
The use of as
alias is more evident when you do join
of two datsets having same column names where you can write condition for joining using the alias.
But to use alias for use in sql queries you will have to register the table
ds1.registerTempTable("xyz")
spark.sql("select * from xyz").show(truncate = false)
which should give you the correct result
+---+---+
|_1 |_2 |
+---+---+
|1 |2 |
|3 |4 |
+---+---+
Or even better do it in a new way
ds1.createOrReplaceTempView("xyz")
Upvotes: 2