Reputation: 31
I have pandas and pyspark dataframes with lists of bigrams per row per day basis. I want to break the lists and move each bigram to a single row with counts in descending order.
As shown below is the dataframe I have. I have sorted by the event_dt. In the column 'merged', the list of bigrams are shown. For example, 'nice meet' and 'meet postpaid' are two bigrams. The list of bigrams continues for each day...
Now I want to move each biagram to a new row for the same date. For example, 'nice meet' will be in a row and 'meet postpaid' will go on another row, while the event_dt columns shows the same date. I also want the counts of bigrams in descending order, and also want to get top 10 or top 20 counts bigrams for each date. Here I want it to look like,
As you can see each bigram comes in a row with counts with the same date. The first row shows the bigram 'customer care' because this bigram is happening down the line 20 times which the pyspark/pandas is calculating (which I need help!) Appreciate the help in advance!
Upvotes: 0
Views: 186
Reputation: 1084
I think you need something like this:
from pyspark.sql.functions import explode
df_res = (df
.select('event_dt', explode('merged').alias('bigram'))
.groupBy('event_dt', 'bigram')
.count()
)
df_res.show()
Notice that with this code if you have, let's say, "customer care" in 2 different days, my version will count them separately and you'll find a row for each day it appeared.
UPDATE
If you want to see the top n from each group, you should do as follows:
from pyspark.sql.functions as f
from pyspark.sql.window import Window
n = 10
win = Window.partitionBy('event_dt').orderBy(f.desc('count'))
(df_res
.withColumn('rank', f.row_number().over(win))
.filter('rank<=' + str(n))
.show()
)
Upvotes: 0