Reputation: 737
I have a dataset
simpleData = [("person1","city1"), \
("person1","city2"), \
("person1","city1"), \
("person1","city3"), \
("person1","city1"), \
("person2","city3"), \
("person2","city2"), \
("person2","city3"), \
("person2","city3") \
]
columns= ["persons_name","city_visited"]
exp = spark.createDataFrame(data = simpleData, schema = columns)
exp.printSchema()
exp.show()
which looks like this-
root
|-- persons_name: string (nullable = true)
|-- city_visited: string (nullable = true)
+------------+------------+
|persons_name|city_visited|
+------------+------------+
| person1| city1|
| person1| city2|
| person1| city1|
| person1| city3|
| person1| city1|
| person2| city3|
| person2| city2|
| person2| city3|
| person2| city3|
+------------+------------+
Now I want to make n number of new columns where n is the number of unique items in a column named 'city_visited' such that it holds the frequency of all the unique items for all the people. The output should look something like the following-
+------------+-----+-----+-----+
|persons_name|city1|city2|city3|
+------------+-----+-----+-----+
| person1| 3| 1| 1|
| person2| 0| 1| 3|
+------------+-----+-----+-----+
How can I achieve this?
Upvotes: 0
Views: 632
Reputation: 42352
pivot
after groupBy
:
exp.groupBy('persons_name').pivot('city_visited').count()
If you want 0 instead of null
:
exp.groupBy('persons_name').pivot('city_visited').count().fillna(0)
And if you want to sort by persons_name
, append .orderBy('persons_name')
to the queries.
Upvotes: 1