Move each bigram from a list format to a new row in Pandas or Pyspark dataframes

Question

I have pandas and pyspark dataframes with lists of bigrams per row per day basis. I want to break the lists and move each bigram to a single row with counts in descending order.

As shown below is the dataframe I have. I have sorted by the event_dt. In the column 'merged', the list of bigrams are shown. For example, 'nice meet' and 'meet postpaid' are two bigrams. The list of bigrams continues for each day...

Now I want to move each biagram to a new row for the same date. For example, 'nice meet' will be in a row and 'meet postpaid' will go on another row, while the event_dt columns shows the same date. I also want the counts of bigrams in descending order, and also want to get top 10 or top 20 counts bigrams for each date. Here I want it to look like,

As you can see each bigram comes in a row with counts with the same date. The first row shows the bigram 'customer care' because this bigram is happening down the line 20 times which the pyspark/pandas is calculating (which I need help!) Appreciate the help in advance!

ggagliano · Accepted Answer

I think you need something like this:

from pyspark.sql.functions import explode

df_res = (df
         .select('event_dt', explode('merged').alias('bigram'))
         .groupBy('event_dt', 'bigram')
         .count()
         )
df_res.show()

Notice that with this code if you have, let's say, "customer care" in 2 different days, my version will count them separately and you'll find a row for each day it appeared.

UPDATE

If you want to see the top n from each group, you should do as follows:

from pyspark.sql.functions as f
from pyspark.sql.window import Window

n = 10
win = Window.partitionBy('event_dt').orderBy(f.desc('count'))

(df_res
 .withColumn('rank', f.row_number().over(win))
 .filter('rank<=' + str(n))
 .show()
)

Move each bigram from a list format to a new row in Pandas or Pyspark dataframes

Answers (1)

Related Questions