Reputation: 19330
I have a CSV file with two columns
id, features
the id column is a string and the features column is a comma delimited list of feature values for a Machine Learning algorithm ie. "[1,4,5]" I basically just need to call Vectors.parse() on the value to get a vector, but I don't want to convert to an RDD first.
I want to get this into a Spark Dataframe where the features column is a org.apache.spark.mllib.linalg.Vector
I am reading this into a dataframe with the databricks csv api and I'm trying to convert the features column to a Vector.
Does anyone know how to do this in Java?
Upvotes: 0
Views: 596
Reputation: 19330
I found one way to do it with a UDF. Are there any other ways to do this?
HashMap<String, String> options = new HashMap<String, String>();
options.put("header", "true");
String input= args[0];
sqlc.udf().register("toVector", new UDF1<String, Vector>() {
@Override
public Vector call(String t1) throws Exception {
return Vectors.parse(t1);
}
}, new VectorUDT());
StructField[] fields = {new StructField("id",DataTypes.StringType,false, Metadata.empty()) , new StructField("features", DataTypes.StringType, false, Metadata.empty())};
StructType schema = new StructType(fields);
DataFrame df = sqlc.read().format("com.databricks.spark.csv").schema(schema).options(options).load(input);
df = df.withColumn("features", functions.callUDF("toVector", df.col("features")));
Upvotes: 1