linpingta
linpingta

Reputation: 2630

MLlib.DataFrame format for model train

Most of examples for MLlib.train are using data loaded from libsvm file. However, for my case, I load data from hive and store it in dataframe directly.

I wonder how to organize my dataframe (like generate label part and feature part), to make it used by model directly? I don't want to store dataframe back to libsvm file for future training, but if it's necessity, also appreciated for how to do that.

Thanks a lot.

============ UPDATE ========

For example, I have a dataframe like below:


feature1 | feature2 | feature3 | feature4 | target

1 1 2 1 1

2 1 3 5 1

1 2 1 1 0

......

I want to treat the last column as target, and other columns as features, then put it to decision tree like below:

model = DecisionTree.trainClassifier(df, numClasses=2, categoricalFeaturesInfo={},
                                 impurity='gini', maxDepth=5, maxBins=32)

How could I define which column to feature or target in "df" here? Because for most examples, df is loaded from libsvm file as:

data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')

Upvotes: 0

Views: 131

Answers (1)

asymptote
asymptote

Reputation: 1402

You can do that by creating an RDD of LabeledPoints. Have a look at this example. In this example, parts(0) is your target variable while parts(1) contains a space separated list of features.

Upvotes: 0

Related Questions