Reputation: 1305
For a script that I am running, I have a bunch of chained views that looked at a specific set of data in sql (I am using Apache Spark SQL):
%sql
create view view_1 as
select column_1,column_2 from original_data_table
This logic culminates in view_n
.
However, I then need to perform logic that is difficult (or impossible) to implement in sql, specifically, the explode
command:
%python
df_1 = sqlContext.sql("SELECT * from view_n")
df1_exploded=df_1.withColumn("exploded_column", explode(split(df_1f.col_to_explode,',')))
Is there a speed cost associated with switching to and from sql tables to pyspark dataframes? Or, since pyspark dataframes are lazily evaluated, is it very similair to a view?
Is there a better way of switching from and sql table to a pyspark dataframe?
Upvotes: 3
Views: 23463
Reputation: 2767
You can use explode()
and just about anything that DF has via Spark SQL (https://spark.apache.org/docs/latest/api/sql/index.html)
print(spark.version)
2.4.3
df = spark.createDataFrame([(1, [1,2,3]), (2, [4,5,6]), (3, [7,8,9]),],["id", "nest"])
df.printSchema()
root
|-- id: long (nullable = true)
|-- nest: array (nullable = true)
| |-- element: long (containsNull = true)
df.createOrReplaceTempView("sql_view")
spark.sql("SELECT id, explode(nest) as un_nest FROM sql_view").show()
df.createOrReplaceTempView("sql_view")
spark.sql("SELECT id, explode(nest) as flatten FROM sql_view").show()
+---+-------+
| id|flatten|
+---+-------+
| 1| 1|
| 1| 2|
| 1| 3|
| 2| 4|
| 2| 5|
| 2| 6|
| 3| 7|
| 3| 8|
| 3| 9|
+---+-------+
Upvotes: 6