Reputation: 103
I'm trying to apply LinearRegression on a set up bins that have been generated. The DataFrame that contains the bin looks like currently looks like DataFrame[features: vector, trip_duration: int, prediction: double]. The bin is labeled prediction. Currently, my code looks like this
predictions = crossval.fit(trainingData).transform(trainingData)
'''
DataFrame[features: vector, trip_duration: int, prediction: double]
'''
transform_udf = udf(lambda x: vecAssembler.transform(x))
bins = predictions.groupBy("prediction").agg(transform_udf(predictions.features)).show()
However when I run this code I get the following error:
Traceback (most recent call last):
File "/opt/spark/python/pyspark/serializers.py", line 590, in dumps
return cloudpickle.dumps(obj, 2)
File "/opt/spark/python/pyspark/cloudpickle.py", line 863, in dumps
cp.dump(obj)
File "/opt/spark/python/pyspark/cloudpickle.py", line 260, in dump
return Pickler.dump(self, obj)
File "/usr/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 554, in save_tuple
save(element)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/spark/python/pyspark/cloudpickle.py", line 400, in save_function
self.save_function_tuple(obj)
File "/opt/spark/python/pyspark/cloudpickle.py", line 549, in save_function_tuple
save(state)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 655, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib/python2.7/pickle.py", line 687, in _batch_setitems
save(v)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 606, in save_list
self._batch_appends(iter(obj))
File "/usr/lib/python2.7/pickle.py", line 642, in _batch_appends
save(tmp[0])
File "/usr/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/usr/lib/python2.7/pickle.py", line 425, in save_reduce
save(state)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 655, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib/python2.7/pickle.py", line 687, in _batch_setitems
save(v)
File "/usr/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 332, in get_return_value
format(target_id, ".", name, value))
Py4JError: An error occurred while calling o163.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "part2.py", line 118, in <module>
main()
File "part2.py", line 106, in main
bins = predictions.groupBy("prediction").agg(transform_udf(predictions.features)).show()
File "/opt/spark/python/pyspark/sql/udf.py", line 189, in wrapper
return self(*args)
File "/opt/spark/python/pyspark/sql/udf.py", line 167, in __call__
judf = self._judf
File "/opt/spark/python/pyspark/sql/udf.py", line 151, in _judf
self._judf_placeholder = self._create_judf()
File "/opt/spark/python/pyspark/sql/udf.py", line 160, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File "/opt/spark/python/pyspark/sql/udf.py", line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/opt/spark/python/pyspark/rdd.py", line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/opt/spark/python/pyspark/serializers.py", line 600, in dumps
raise pickle.PicklingError(msg)
cPickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o163.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
How do I apply the linear regression model on data that has a particular prediction? Note that I'm trying to apply the linear regression model for all of the data, grouped according to the prediction. So I want to run lrm on:
[row 6 - prediction 1,
row 4 - prediction 1,
row 8 - prediction 1]
[row 2 - prediction 2,
row 5 - prediction 2,
row 1 - prediction 2,
row 7 - prediction 2]
[row 3 - prediction 3]
Without using pandas.
Upvotes: 0
Views: 147
Reputation: 1486
Conveniently, for a linear regression of the form,
with the standard ordinary least squares assumptions, the estimated parameters, have an analytical solution as follows.
X is your features, y is your label and subscript T and -1 are the matrix transpose and matrix inverse respectively.
You can write a pandas_udf
to compute your linear regression parameters with the formula above and apply it after groupBy
. Note that standard udf
which you are now using won't work with groupBy
.
Upvotes: 1