Arpit Sisodia
Arpit Sisodia

Reputation: 649

TypeError: first() missing 1 required positional argument: 'offset' in DecisionTree.trainClassifier in Pyspark

I have written simple code in pyspark on Azure databricks( followed this link decision tree in pyspark-)

%python
x='x'
z='y'
data = pd.DataFrame({'a':[1,2,3,41,2,6,2,3,56,1,2,5,1,2,45,1,3,2], 'b':[x,z,x,x,z,x,z,x,x,x,z,z,x,z,z,x,x,x]})

# Train a DecisionTree model.
model = DecisionTree.trainClassifier(data, numClasses=2, categoricalFeaturesInfo={},impurity='gini', maxDepth=5, maxBins=32)

I have kept parameters as default. While running I get error-

TypeError: first() missing 1 required positional argument: 'offset'

I am not sure which argument this error is referring to and also where do i need to specify my dependent variable in classifier?

enter image description here

Upvotes: 1

Views: 5003

Answers (1)

Jim Todd
Jim Todd

Reputation: 1588

The trainClassifier takes the first parameter to be an RDD. Here, data you have given is a pandas dataframe. You see the error because, first() is a method that can be applied on spark objects.

As per documentation, Training data: RDD of LabeledPoint. Labels should take values {0, 1, …, numClasses-1}.

Hence, convert data to RDD, and that should work fine.

Upvotes: 2

Related Questions