Reputation: 2630
Most of examples for MLlib.train are using data loaded from libsvm file. However, for my case, I load data from hive and store it in dataframe directly.
I wonder how to organize my dataframe (like generate label part and feature part), to make it used by model directly? I don't want to store dataframe back to libsvm file for future training, but if it's necessity, also appreciated for how to do that.
Thanks a lot.
============ UPDATE ========
For example, I have a dataframe like below:
feature1 | feature2 | feature3 | feature4 | target
1 1 2 1 1
2 1 3 5 1
1 2 1 1 0
......
I want to treat the last column as target, and other columns as features, then put it to decision tree like below:
model = DecisionTree.trainClassifier(df, numClasses=2, categoricalFeaturesInfo={},
impurity='gini', maxDepth=5, maxBins=32)
How could I define which column to feature or target in "df" here? Because for most examples, df is loaded from libsvm file as:
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
Upvotes: 0
Views: 131
Reputation: 1402
You can do that by creating an RDD of LabeledPoints. Have a look at this example. In this example, parts(0) is your target variable while parts(1) contains a space separated list of features.
Upvotes: 0