Reputation: 5792
I am trying to compute the weekly occurrences of a word. That is, whether each word is more frequent this week than the previous week. For that I am kind of stuck. I did the following:
m = sc.parallelize(["oded,12-12-2018", "oded,12-03-2018", "oded,12-12-2018", "oded,12-06-2018", "oded2,12-02-2018", "oded2,12-02-2018"])
m = m.map(lambda line: line.split(','))
weekly = m.map(lambda line: (line[0], (parse(line[1]).strftime("%V%y"))))
s = sql.createDataFrame(daily)
s.groupby("_1", "_2").count().sort("_2")
the result is:
+-----+----+-----+
| _1| _2|count|
+-----+----+-----+
|oded2|4818| 2|
| oded|4918| 2|
| oded|5018| 2|
+-----+----+-----+
How can I go and get oded: 0 = ( 2 - 2 ) and oded2: 2 = (2 - 0)
Upvotes: 1
Views: 67
Reputation: 1224
Hi you can use lag window function to find value from previous week , after you count words peer week. For weeks that don't have previous value value for count will be zero or you can use na.drop() to remove that lines completely.
from pyspark.sql.functions import lag, col,coalesce
from pyspark.sql.window import Window
w = Window().partitionBy("_1").orderBy(col("_2"))
s.select("*", lag("count").over(w).alias("prev_week")).na.fill(0).show()
Upvotes: 2