user14237286
user14237286

Reputation: 109

How Do I Programmatically Use "Count" In Pyspark?

Trying to do a simple count in Pyspark programmatically but coming up with errors. .count() works at the end of the statement if I drop AS (count(city)) but I need the count to appear inside not on the outside.

result = spark.sql("SELECT city AS (count(city)) AND business_id FROM business WHERE city = 'Reading'") 

One of many errors

Py4JJavaError: An error occurred while calling o24.sql.
: org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input '(' expecting ')'(line 1, pos 21)

== SQL ==
SELECT city AS (count(city)) AND business_id FROM business WHERE city = 'Reading'
---------------------^^^

Upvotes: 0

Views: 134

Answers (2)

user14237286
user14237286

Reputation: 109

Just my solution to the problem I'm trying to solve. The solution above is where I would like to be at.

result = spark.sql("SELECT count(*) FROM business WHERE city='Reading'")

Upvotes: 0

mck
mck

Reputation: 42422

Your syntax is incorrect. Maybe you want to do this instead:

result = spark.sql("""
    SELECT 
        count(city) over(partition by city), 
        business_id 
    FROM business 
    WHERE city = 'Reading'
""")

You need to provide a window if you use count without group by. In this case, you probably want a count for each city.

Upvotes: 2

Related Questions