user10910107
user10910107

Reputation: 3

Counting how many times each distinct value occurs in a column in PySparkSQL Join

I have used PySpark SQL to join together two tables, one containing crime location data with longitude and latitude and the other containing postcodes with their corresponding longitude and latitude.

What I am trying to work out is how to tally up how many crimes have occurred within each postcode. I am new to PySpark and my SQL is rusty so I am unsure where I am going wrong.

I have tried to use COUNT(DISTINCT) but that is simply giving me the total number of distinct postcodes.

mySchema = StructType([StructField("Longitude", StringType(),True), StructField("Latitude", StringType(),True)])

bgl_df = spark.createDataFrame(burglary_rdd, mySchema)

bgl_df.registerTempTable("bgl")

rdd2 = spark.sparkContext.textFile("posttrans.csv")

mySchema2 = StructType([StructField("Postcode", StringType(),True), StructField("Lon", StringType(),True), StructField("Lat", StringType(),True)])

pcode_df = spark.createDataFrame(pcode_rdd, mySchema2)

pcode_df.registerTempTable("pcode")

count = spark.sql("SELECT COUNT(DISTINCT pcode.Postcode) 
                   FROM pcode RIGHT JOIN bgl 
                   ON (bgl.Longitude = pcode.Lon 
                   AND bgl.Latitude = pcode.Lat)")

+------------------------+
|count(DISTINCT Postcode)|
+------------------------+
|                  523371|
+------------------------+

Instead I want something like:

+--------+---+
|Postcode|Num|
+--------+---+
|LN11 9DA| 2 |
|BN10 8JX| 5 |
| EN9 3YF| 9 |
|EN10 6SS| 1 |
+--------+---+

Upvotes: 0

Views: 96

Answers (1)

pissall
pissall

Reputation: 7399

You can do a groupby count to get a distinct count of values for a column:

group_df = df.groupby("Postcode").count()

You will get the ouput you want.

For an SQL query:

query =  """
         SELECT pcode.Postcode, COUNT(pcode.Postcode) AS Num
         FROM pcode 
         RIGHT JOIN bgl 
         ON (bgl.Longitude = pcode.Lon AND bgl.Latitude = pcode.Lat)
         GROUP BY pcode.Postcode
         """

count = spark.sql(query)

Also, I have copied in from your FROM and JOIN clause to make the query more relevant for copy-pasta.

Upvotes: 1

Related Questions