Reputation: 649
I have written simple code in pyspark on Azure databricks( followed this link decision tree in pyspark-)
%python
x='x'
z='y'
data = pd.DataFrame({'a':[1,2,3,41,2,6,2,3,56,1,2,5,1,2,45,1,3,2], 'b':[x,z,x,x,z,x,z,x,x,x,z,z,x,z,z,x,x,x]})
# Train a DecisionTree model.
model = DecisionTree.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={},impurity='gini', maxDepth=5, maxBins=32)
I have kept parameters as default. While running I get error-
TypeError: first() missing 1 required positional argument: 'offset'
I am not sure which argument this error is referring to and also where do i need to specify my dependent variable in classifier?
Upvotes: 1
Views: 5003
Reputation: 1588
The trainClassifier
takes the first parameter to be an RDD. Here, data you have given is a pandas dataframe. You see the error because, first()
is a method that can be applied on spark objects.
As per documentation, Training data: RDD of LabeledPoint. Labels should take values {0, 1, …, numClasses-1}.
Hence, convert data
to RDD, and that should work fine.
Upvotes: 2