How Do I Programmatically Use "Count" In Pyspark?

Question

Trying to do a simple count in Pyspark programmatically but coming up with errors. .count() works at the end of the statement if I drop AS (count(city)) but I need the count to appear inside not on the outside.

result = spark.sql("SELECT city AS (count(city)) AND business_id FROM business WHERE city = 'Reading'")

One of many errors

Py4JJavaError: An error occurred while calling o24.sql.
: org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input '(' expecting ')'(line 1, pos 21)

== SQL ==
SELECT city AS (count(city)) AND business_id FROM business WHERE city = 'Reading'
---------------------^^^

mck · Accepted Answer

Your syntax is incorrect. Maybe you want to do this instead:

result = spark.sql("""
    SELECT 
        count(city) over(partition by city), 
        business_id 
    FROM business 
    WHERE city = 'Reading'
""")

You need to provide a window if you use count without group by. In this case, you probably want a count for each city.

How Do I Programmatically Use "Count" In Pyspark?

Answers (2)

Related Questions

How Do I Programmatically Use &quot;Count&quot; In Pyspark?

Answers (2)

Related Questions

How Do I Programmatically Use "Count" In Pyspark?