Calling .agg on udf function throws error

Question

I'm trying to apply LinearRegression on a set up bins that have been generated. The DataFrame that contains the bin looks like currently looks like DataFrame[features: vector, trip_duration: int, prediction: double]. The bin is labeled prediction. Currently, my code looks like this

    predictions = crossval.fit(trainingData).transform(trainingData)
    ''' 
    DataFrame[features: vector, trip_duration: int, prediction: double]
    '''
    transform_udf = udf(lambda x: vecAssembler.transform(x))
    bins = predictions.groupBy("prediction").agg(transform_udf(predictions.features)).show()

However when I run this code I get the following error:

Traceback (most recent call last):
  File "/opt/spark/python/pyspark/serializers.py", line 590, in dumps
    return cloudpickle.dumps(obj, 2)
  File "/opt/spark/python/pyspark/cloudpickle.py", line 863, in dumps
    cp.dump(obj)
  File "/opt/spark/python/pyspark/cloudpickle.py", line 260, in dump
    return Pickler.dump(self, obj)
  File "/usr/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 554, in save_tuple
    save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/spark/python/pyspark/cloudpickle.py", line 400, in save_function
    self.save_function_tuple(obj)
  File "/opt/spark/python/pyspark/cloudpickle.py", line 549, in save_function_tuple
    save(state)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/usr/lib/python2.7/pickle.py", line 687, in _batch_setitems
    save(v)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 606, in save_list
    self._batch_appends(iter(obj))
  File "/usr/lib/python2.7/pickle.py", line 642, in _batch_appends
    save(tmp[0])
  File "/usr/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/lib/python2.7/pickle.py", line 425, in save_reduce
    save(state)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/usr/lib/python2.7/pickle.py", line 687, in _batch_setitems
    save(v)
  File "/usr/lib/python2.7/pickle.py", line 306, in save
    rv = reduce(self.proto)
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 332, in get_return_value
    format(target_id, ".", name, value))
Py4JError: An error occurred while calling o163.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)


Traceback (most recent call last):
  File "part2.py", line 118, in 
    main()
  File "part2.py", line 106, in main
    bins = predictions.groupBy("prediction").agg(transform_udf(predictions.features)).show()
  File "/opt/spark/python/pyspark/sql/udf.py", line 189, in wrapper
    return self(*args)
  File "/opt/spark/python/pyspark/sql/udf.py", line 167, in __call__
    judf = self._judf
  File "/opt/spark/python/pyspark/sql/udf.py", line 151, in _judf
    self._judf_placeholder = self._create_judf()
  File "/opt/spark/python/pyspark/sql/udf.py", line 160, in _create_judf
    wrapped_func = _wrap_function(sc, self.func, self.returnType)
  File "/opt/spark/python/pyspark/sql/udf.py", line 35, in _wrap_function
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
  File "/opt/spark/python/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
    pickled_command = ser.dumps(command)
  File "/opt/spark/python/pyspark/serializers.py", line 600, in dumps
    raise pickle.PicklingError(msg)
cPickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o163.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

How do I apply the linear regression model on data that has a particular prediction? Note that I'm trying to apply the linear regression model for all of the data, grouped according to the prediction. So I want to run lrm on:

[row 6 - prediction 1,
row 4 - prediction 1,
row 8 - prediction 1]

[row 2 - prediction 2,
row 5 - prediction 2,
row 1 - prediction 2,
row 7 - prediction 2]

[row 3 - prediction 3]

Without using pandas.

QuantStats · Accepted Answer

Conveniently, for a linear regression of the form,

$y = X\beta+\varepsilon,$

with the standard ordinary least squares assumptions, the estimated parameters, $\hat{\beta}$ have an analytical solution as follows.

$\:\hat{\beta} = (X^{T}X)^{-1}X^{T}y$

X is your features, y is your label and subscript T and -1 are the matrix transpose and matrix inverse respectively.

You can write a pandas_udf to compute your linear regression parameters with the formula above and apply it after groupBy. Note that standard udf which you are now using won't work with groupBy.

Calling .agg on udf function throws error

Answers (1)

Related Questions