Mike Pone
Mike Pone

Reputation: 19330

Converting CSV values to Vector in Spark Dataframe in Java

I have a CSV file with two columns

id, features

the id column is a string and the features column is a comma delimited list of feature values for a Machine Learning algorithm ie. "[1,4,5]" I basically just need to call Vectors.parse() on the value to get a vector, but I don't want to convert to an RDD first.

I want to get this into a Spark Dataframe where the features column is a org.apache.spark.mllib.linalg.Vector

I am reading this into a dataframe with the databricks csv api and I'm trying to convert the features column to a Vector.

Does anyone know how to do this in Java?

Upvotes: 0

Views: 596

Answers (1)

Mike Pone
Mike Pone

Reputation: 19330

I found one way to do it with a UDF. Are there any other ways to do this?

  HashMap<String, String> options = new HashMap<String, String>();
  options.put("header", "true");
  String input= args[0];

  sqlc.udf().register("toVector", new UDF1<String, Vector>() {
     @Override
     public Vector call(String t1) throws Exception {
        return Vectors.parse(t1);
     }
  }, new VectorUDT());

  StructField[] fields = {new StructField("id",DataTypes.StringType,false, Metadata.empty()) , new StructField("features", DataTypes.StringType, false, Metadata.empty())};
  StructType schema = new StructType(fields);

  DataFrame df = sqlc.read().format("com.databricks.spark.csv").schema(schema).options(options).load(input);

  df = df.withColumn("features", functions.callUDF("toVector", df.col("features")));

Upvotes: 1

Related Questions