Himberjack
Himberjack

Reputation: 5792

PySpark computation of weekly occurrences

I am trying to compute the weekly occurrences of a word. That is, whether each word is more frequent this week than the previous week. For that I am kind of stuck. I did the following:

m = sc.parallelize(["oded,12-12-2018", "oded,12-03-2018", "oded,12-12-2018", "oded,12-06-2018", "oded2,12-02-2018", "oded2,12-02-2018"])
        m = m.map(lambda line: line.split(','))
        weekly = m.map(lambda line: (line[0], (parse(line[1]).strftime("%V%y"))))
        s = sql.createDataFrame(daily)
        s.groupby("_1", "_2").count().sort("_2")

the result is:

+-----+----+-----+
|   _1|  _2|count|
+-----+----+-----+
|oded2|4818|    2|
| oded|4918|    2|
| oded|5018|    2|
+-----+----+-----+

How can I go and get oded: 0 = ( 2 - 2 ) and oded2: 2 = (2 - 0)

Upvotes: 1

Views: 67

Answers (1)

zlidime
zlidime

Reputation: 1224

Hi you can use lag window function to find value from previous week , after you count words peer week. For weeks that don't have previous value value for count will be zero or you can use na.drop() to remove that lines completely.

from pyspark.sql.functions import lag, col,coalesce
from pyspark.sql.window import Window
w = Window().partitionBy("_1").orderBy(col("_2"))
s.select("*", lag("count").over(w).alias("prev_week")).na.fill(0).show()

Upvotes: 2

Related Questions