How to fit a kernel density estimate on a pyspark dataframe column and use it for creating a new column with the estimates

Question

My use is the following. Consider I have a pyspark dataframe which has the following format: df.columns: 1. hh: Contains the hour of the day (type int) 2. userId : some unique identifier.

What I want to do is I want to figure out list of userIds which have anomalous hits onto the page. So I first do a groupby as so: df=df.groupby("hh","userId).count().alias("LoginCounts)

Now the format of the dataframe would be: 1. hh 2. userId 3.LoginCounts: Number of times a specific user logs in at a particular hour.

I want to use the pyspark kde function as follows:

from pyspark.mllib.stat import KernelDensity
kd=KernelDensity()
kd.setSample(df.select("LoginCounts").rdd)
kd.estimate([13.0,14.0]).

I get the error: Py4JJavaError: An error occurred while calling o647.estimateKernelDensity. : org.apache.spark.SparkException: Job aborted due to stage failure

Now my end goal is to fit a kde on say a day's hour based data and then use the next day's data to get the probability estimates for each login count. Eg: I would like to achieve something of this nature:

df.withColumn("kdeProbs",kde.estimate(col("LoginCounts)))

So the column kdeProbs will contain P(LoginCount=x | estimated kde).

I have tried searching for an example of the same but am always redirected to the standard kde example on the spark.apache.org page, which does not solve my case.

How to fit a kernel density estimate on a pyspark dataframe column and use it for creating a new column with the estimates

Answers (1)

Related Questions