NewPy
NewPy

Reputation: 663

Run SQL query on dataframe via Pyspark

I would like to run sql query on dataframe but do I have to create a view on this dataframe? Is there any easier way?

df = spark.createDataFrame([
('a', 1, 1), ('a',1, None), ('b', 1, 1),
('c',1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')

and the query I want to run some complex queries against this dataframe. For example I can do

df.createOrReplaceTempView("temp_view")
df_new = pyspark.sql("select id,max(foo) from temp_view group by id")

but do I have to convert it to view first before querying it? I know there is a dataframe equivalent operation. The above query is only an example.

Upvotes: 3

Views: 4584

Answers (1)

fuzzy-memory
fuzzy-memory

Reputation: 195

You can just do

df.select('id', 'foo')

This will return a new Spark DataFrame with columns id and foo.

Upvotes: 2

Related Questions