Reputation: 169
How can we find the number of words in a column of a spark dataframe without using REPLACE() function of SQL ? Below is the code and input I am working with but the replace() function does not work.
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("Python Spark SQL example") \
.enableHiveSupport() \
.getOrCreate()
parqFileName = 'gs://caserta-pyspark-eval/train.pqt'
tuesdayDF = my_spark.read.parquet(parqFileName)
tuesdayDF.createOrReplaceTempView("parquetFile")
tuesdaycrimes = spark.sql("SELECT LENGTH(Address) - LENGTH(REPLACE(Address, ' ', ''))+1 FROM parquetFile")
print(tuesdaycrimes.show())
+-------------------+--------------+--------------------+---------+----------+--------------+--------------------+-----------+---------+
| Dates| Category| Descript|DayOfWeek|PdDistrict| Resolution| Address| X| Y|
+-------------------+--------------+--------------------+---------+----------+--------------+--------------------+-----------+---------+
|2015-05-14 03:53:00| WARRANTS| WARRANT ARREST|Wednesday| NORTHERN|ARREST, BOOKED| OAK ST / LAGUNA ST| -122.42589|37.774597|
|2015-05-14 03:53:00|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday| NORTHERN|ARREST, BOOKED| OAK ST / LAGUNA ST| -122.42589|37.774597|
|2015-05-14 03:33:00|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday| NORTHERN|ARREST, BOOKED|VANNESS AV / GREE...| -122.42436|37.800415|
Upvotes: 15
Views: 52943
Reputation: 11694
Using Spark SQL
SELECT word, count(*)
FROM
(SELECT explode(split(Description, ' ')) AS word FROM mytable)
GROUP BY 1
ORDER BY 2 DESC
Full example
data = [
("2015-05-14 03:53:00", "WARRANT ARREST"),
("2015-05-14 03:53:00", "TRAFFIC VIOLATION"),
("2015-05-14 03:33:00", "TRAFFIC VIOLATION")
]
df = spark.createDataFrame(data, ["Dates", "Description"])
df.createOrReplaceTempView("mytable")
spark.sql("""
SELECT word, count(*)
FROM
(SELECT explode(split(Description, ' ')) AS word FROM mytable)
GROUP BY 1
ORDER BY 2 DESC
""").show()
Upvotes: 0
Reputation: 43494
There are number of ways to count the words using pyspark DataFrame functions, depending on what it is you are looking for.
Create Example Data
import pyspark.sql.functions as f
data = [
("2015-05-14 03:53:00", "WARRANT ARREST"),
("2015-05-14 03:53:00", "TRAFFIC VIOLATION"),
("2015-05-14 03:33:00", "TRAFFIC VIOLATION")
]
df = sqlCtx.createDataFrame(data, ["Dates", "Description"])
df.show()
In this example, we will count the words in the Description
column.
Count in each row
If you wanted the count of words in the specified column for each row you can create a new column using withColumn()
and do the following:
pyspark.sql.functions.split()
to break the string into a listpyspark.sql.functions.size()
to count the length of the listFor example:
df = df.withColumn('wordCount', f.size(f.split(f.col('Description'), ' ')))
df.show()
#+-------------------+-----------------+---------+
#| Dates| Description|wordCount|
#+-------------------+-----------------+---------+
#|2015-05-14 03:53:00| WARRANT ARREST| 2|
#|2015-05-14 03:53:00|TRAFFIC VIOLATION| 2|
#|2015-05-14 03:33:00|TRAFFIC VIOLATION| 2|
#+-------------------+-----------------+---------+
Sum word count over all rows
If you wanted to count the total number of words in the column across the entire DataFrame, you can use pyspark.sql.functions.sum()
:
df.select(f.sum('wordCount')).collect()
#[Row(sum(wordCount)=6)]
Count occurrence of each word
If you wanted the count of each word in the entire DataFrame, you can use split()
and pyspark.sql.function.explode()
followed by a groupBy
and count()
.
df.withColumn('word', f.explode(f.split(f.col('Description'), ' ')))\
.groupBy('word')\
.count()\
.sort('count', ascending=False)\
.show()
#+---------+-----+
#| word|count|
#+---------+-----+
#| TRAFFIC| 2|
#|VIOLATION| 2|
#| WARRANT| 1|
#| ARREST| 1|
#+---------+-----+
Upvotes: 44
Reputation: 4420
You can do it just using split
and size
of pyspark API
functions (Below is example):-
sqlContext.createDataFrame([['this is a sample address'],['another address']])\
.select(F.size(F.split(F.col("_1"), " "))).show()
Below is Output:-
+------------------+
|size(split(_1, ))|
+------------------+
| 5|
| 2|
+------------------+
Upvotes: 2
Reputation: 41957
You can define a udf
function as
def splitAndCountUdf(x):
return len(x.split(" "))
from pyspark.sql import functions as F
countWords = F.udf(splitAndCountUdf, 'int')
and call it using .withColumn
function as
tuesdayDF.withColumn("wordCount", countWords(tuesdayDF.address))
And if you want distinct count of words, you can change the udf
function to include set
as
def splitAndCountUdf(x):
return len(set(x.split(" ")))
from pyspark.sql import functions as F
countWords = F.udf(splitAndCountUdf, 'int')
Upvotes: 1