How to count and store frequency of items in a column of a PySpark dataframe?

Question

I have a dataset

simpleData = [("person1","city1"), \
    ("person1","city2"), \
    ("person1","city1"), \
    ("person1","city3"), \
    ("person1","city1"), \
    ("person2","city3"), \
    ("person2","city2"), \
    ("person2","city3"), \
    ("person2","city3") \
  ]
columns= ["persons_name","city_visited"]
exp = spark.createDataFrame(data = simpleData, schema = columns)

exp.printSchema()
exp.show()

which looks like this-

root
 |-- persons_name: string (nullable = true)
 |-- city_visited: string (nullable = true)

+------------+------------+
|persons_name|city_visited|
+------------+------------+
|     person1|       city1|
|     person1|       city2|
|     person1|       city1|
|     person1|       city3|
|     person1|       city1|
|     person2|       city3|
|     person2|       city2|
|     person2|       city3|
|     person2|       city3|
+------------+------------+

Now I want to make n number of new columns where n is the number of unique items in a column named 'city_visited' such that it holds the frequency of all the unique items for all the people. The output should look something like the following-

+------------+-----+-----+-----+
|persons_name|city1|city2|city3|
+------------+-----+-----+-----+
|     person1|    3|    1|    1|
|     person2|    0|    1|    3|
+------------+-----+-----+-----+

How can I achieve this?

mck · Accepted Answer

pivot after groupBy:

exp.groupBy('persons_name').pivot('city_visited').count()

If you want 0 instead of null:

exp.groupBy('persons_name').pivot('city_visited').count().fillna(0)

And if you want to sort by persons_name, append .orderBy('persons_name') to the queries.

How to count and store frequency of items in a column of a PySpark dataframe?

Answers (1)

Related Questions