Leila
Leila

Reputation: 11

pyspark - count list element and make columns by the element frequency

I want to make a table using list element frequency from dataframe

example)
As Is

+----+---------------+
| id |     data      |            
+----+---------------+
|a   |[1,2,3,4,5]    |
|b   |[2,2,4,5]      |
|c   |[56,7,1,1,1]   |
+----+---------------+

To Be

+----+-----+-----+-----+-----+-----+-----+-----+
| id | 1   | 2   | 3   |  4  |  5  |  7  |  56 |          
+----+-----+-----+-----+-----+-----+-----+-----+
|a   | 1   | 1   | 1   |  1  |  1  |  0  |  0  |    
|b   | 0   | 2   | 0   |  1  |  1  |  0  |  0  |    
|c   | 3   | 0   | 0   |  0  |  0  |  1  |  1  |    
+----+-----+-----+-----+-----+-----+-----+-----+

How can I make "AS IS" to "TO BE"?

Upvotes: 0

Views: 186

Answers (1)

samkart
samkart

Reputation: 6644

A possible way is to explode the arrays and then pivot the exploded values.

# input data
data_sdf.show()

# +---+----------------+
# | id|            data|
# +---+----------------+
# |  a| [1, 2, 3, 4, 5]|
# |  b|    [2, 2, 4, 5]|
# |  c|[56, 7, 1, 1, 1]|
# +---+----------------+

data_sdf. \
    withColumn('data_explode', func.explode('data')). \
    groupBy('id'). \
    pivot('data_explode'). \
    count(). \
    fillna(0). \
    show()

# +---+---+---+---+---+---+---+---+
# | id|  1|  2|  3|  4|  5|  7| 56|
# +---+---+---+---+---+---+---+---+
# |  c|  3|  0|  0|  0|  0|  1|  1|
# |  b|  0|  2|  0|  1|  1|  0|  0|
# |  a|  1|  1|  1|  1|  1|  0|  0|
# +---+---+---+---+---+---+---+---+

Upvotes: 1

Related Questions