Abhishek
Abhishek

Reputation: 25

how to check the count of list of words in pyspark array column?

I want to check the number of total occurrences of list of words in a pyspark array column

ls=['aa','bb']
    
    df
    
| item| 
|:---- |
| ['aa','bb']| 
| ['aa','bc']| 
| ['ad','bc']|

 
    

if this is the dataframe the output should be

| item| count|
|:---- |:------:| 
| ['aa','bb']| 2| 
| ['aa','bc']| 1| 
| ['ad','bc']| 0| 

Upvotes: 1

Views: 356

Answers (1)

过过招
过过招

Reputation: 4189

This can be achieved by slightly modifying the definition of the ls variable.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

ls = [
    (['aa', 'bb'],),
]
data = [
    (['aa', 'bb'],),
    (['aa', 'bc'],),
    (['ad', 'bc'],)
]
df1 = spark.createDataFrame(ls, ['words'])
df2 = spark.createDataFrame(data, ['word_arr'])
df = df2.join(df1, how='full').select('word_arr', F.size(F.array_intersect('word_arr', 'words')).alias('count'))
df.show(truncate=False)

Upvotes: 2

Related Questions