Reputation: 137
I am new to Apache Spark and was trying to convert data from .csv file into LabeledPoint to use MLlib package of Apache Spark. I tried the following code to get the LabeledPoint Data RDD using the following code but it turns out it was a LabeledPoint Data of ML Package. Now I want to get create the correct LabeledPoint Data of MLlib Package. Could anyone please help.
private static String appName = "learning_RDD";
private static String master = "spark://23.195.26.187:7077" ;
static SparkConf sparkConf = new SparkConf().setMaster("local[1]").setAppName("MLPipelineSample").set("spark.driver.memory", "512m").set("spark.sql.warehouse.dir","D:\\input.txt");
static SparkContext sc = new SparkContext(sparkConf);
static SparkSession spark = SparkSession
.builder().sparkContext(sc)
.getOrCreate();
public static void main(String args[]) throws IOException {
Dataset<Row> trainingData = spark.read().format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("D:\\abc\\Spark\\WebcontentClassification_UsingSparkML\\WebcontentClassification_UsingSparkML\\NaiveBayes_ML_20ErrorRate\\nutchcsvalldata.csv");
Tokenizer tokenizer = new Tokenizer().setInputCol("content").setOutputCol("words");
Dataset<Row> words = tokenizer.transform(trainingData);
StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filteredwords");
Dataset<Row> filteredwords = remover.transform(words);
HashingTF hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("filteredwords").setOutputCol("rawfeatures");
Dataset<Row> hashedtf_Vector = hashingTF.transform(filteredwords);
IDF idf = new IDF().setInputCol("rawfeatures").setOutputCol("features");
IDFModel idfModel = idf.fit(hashedtf_Vector);
Dataset<Row> Vectors = idfModel.transform(hashedtf_Vector);
Iterator<Row> iterator = Vectors.toLocalIterator();
List<LabeledPointLabeledPoint> labeledpoints = new ArrayList<LabeledPoint>();
while(iterator.hasNext())
{
Row r = iterator.next();
int label = r.getAs(2);
Vector v = r.getAs(16);
LabeledPoint labeledpoint = new LabeledPoint(label, v);
labeledpoints.add(labeledpoints);
}
// Here I am suppose convert the List into RDD<LabeledPoint> and use SVM Algorithm
}
Upvotes: 1
Views: 2932
Reputation: 137
I had found a solution (I am not posting the exact solution) and to start with converting words into vector the following code could be used, which can later be made to LabeledPoint
JavaRDD<String> lines = spark.read().textFile(Input_file_path).toJavaRDD();
JavaRDD<Iterable<String>> words_iterable = lines.map(new Function<String, Iterable<String>>() {
public Iterable<String> call(String s) throws Exception {
String[] words = s.split(" ");
Iterable<String> output = Arrays.asList(words);
return output;
}
});
Word2Vec word2vec = new Word2Vec();
Word2VecModel word2vecmodel = word2vec.fit(words_iterable);
Upvotes: 0
Reputation: 191738
I want to get create the correct LabeledPoint Data of MLlib Package
import org.apache.spark.mllib.regression.LabeledPoint;
I am suppose convert the List into RDD and use
I think you need to map
over Vectors
and convert the RDD into a format you need.
I've been using Scala, but it might translate roughly like this
RDD<LabeledPoint> training = Vectors.map(r -> {
double label = (double) r.getAs(2); // labels should be doubles
Vector v = r.getAs(16); // maybe convert this to a dense / sparse array
return new LabeledPoint(label, v);
});
Upvotes: 1
Reputation: 56
Lets say you have four fields in each row in your csv file out of which first field is your label and rest of the three fields are your features (assuming all are double values). You can create your LabeledPoint RDD as follows:
JavaSparkContext sc = new JavaSparkContext(sparkConf);
String path = "com.databricks.spark.csv";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<LabeledPoint> parsedData = data
.map(new Function<String, LabeledPoint>() {
public LabeledPoint call(String line) throws Exception {
String[] parts = line.split(",");
return new LabeledPoint(Double.parseDouble(parts[0]),
Vectors.dense(Double.parseDouble(parts[1]),
Double.parseDouble(parts[2]),
Double.parseDouble(parts[3])));
}
});
Upvotes: 3