PySpark computation of weekly occurrences

Question

I am trying to compute the weekly occurrences of a word. That is, whether each word is more frequent this week than the previous week. For that I am kind of stuck. I did the following:

m = sc.parallelize(["oded,12-12-2018", "oded,12-03-2018", "oded,12-12-2018", "oded,12-06-2018", "oded2,12-02-2018", "oded2,12-02-2018"])
        m = m.map(lambda line: line.split(','))
        weekly = m.map(lambda line: (line[0], (parse(line[1]).strftime("%V%y"))))
        s = sql.createDataFrame(daily)
        s.groupby("_1", "_2").count().sort("_2")

the result is:

+-----+----+-----+
|   _1|  _2|count|
+-----+----+-----+
|oded2|4818|    2|
| oded|4918|    2|
| oded|5018|    2|
+-----+----+-----+

How can I go and get oded: 0 = ( 2 - 2 ) and oded2: 2 = (2 - 0)

zlidime · Accepted Answer

Hi you can use lag window function to find value from previous week , after you count words peer week. For weeks that don't have previous value value for count will be zero or you can use na.drop() to remove that lines completely.

from pyspark.sql.functions import lag, col,coalesce
from pyspark.sql.window import Window
w = Window().partitionBy("_1").orderBy(col("_2"))
s.select("*", lag("count").over(w).alias("prev_week")).na.fill(0).show()

PySpark computation of weekly occurrences

Answers (1)

Related Questions