Reputation: 73
I want to run a SVM Regression, but have problems with input format. Right now my train and test set for one customer looks like this:
1 '12262064 |f offer_quantity:1
has_bought_brand_company:1 has_bought_brand_a:6.79 has_bought_brand_q_60:1.0
has_bought_brand:2.0 has_bought_company_a:1.95 has_bought_brand_180:1.0
has_bought_brand_q_180:1.0 total_spend:218.37 has_bought_brand_q:3.0 offer_value:1.5
has_bought_brand_a_60:2.79 has_bought_brand_60:1.0 has_bought_brand_q_90:1.0
has_bought_brand_a_90:2.79 has_bought_company_q:1.0 has_bought_brand_90:1.0
has_bought_company:1.0 never_bought_category:1 has_bought_brand_a_180:2.79
If tried to read this textfile into Spark, but without success. What am I missing? Do I have to delete feature names? Right now its in Vowal Wabbit format.
My code looks like this:
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "mllib/data/train.txt")
Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
Run training algorithm to build the model
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
model.clearThreshold()
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
println("Area under ROC = " + auROC)
``I get an answer, but my AUC value is 1, which shouldnt be the case.
scala> println("Area under ROC = " + auROC)
Area under ROC = 1.0
Upvotes: 4
Views: 1761
Reputation: 2033
I think your File is not in LIBSVM format.If you can convert the file to libsvm format or you will have to load it as normal file and then create a label point This is what i did for my file.
import org.apache.spark.mllib.feature.HashingTF
val tf = new HashingTF(2)
val tweets = sc.textFile(tweetInput)
val labelPoint = tweets.map(l=>{
val parts = l.split(' ')
var t=tf.transform(parts.tail.map(x => x).sliding(2).toSeq)
LabeledPoint(parts(0).toDouble,t )
}).cache()
labelPoint.count()
val model = LinearRegressionWithSGD.train(labelPoint, numIterations)
Upvotes: 1