Mixalis Navridis
Mixalis Navridis

Reputation: 329

Spark 2.2.0:How to prevent CountVectorizer giving NoneZeroVector VectorsUDT from Dataset

Hello, community i am a new member here and i would like to make a simple silly question. i am trying to convert text document in vectors with a given vocabulary. Just the same with the spark examples. My problem is that i take in some cases empty vectors. It is very important to me to do something to handle it but i can't!!

Here is my code

 List<Row> data = Arrays.asList(
                RowFactory.create(Arrays.asList("zero", "zero", "zero")),
                RowFactory.create(Arrays.asList("a", "b", "c")),
                RowFactory.create(Arrays.asList("a", "b", "b", "c", "a"))
        );
        StructType schema = new StructType(new StructField[]{
                new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
        });
        Dataset<Row> df = spark.createDataFrame(data, schema);


         // alternatively, define CountVectorizerModel with a-priori vocabulary
        CountVectorizerModel cvm = new CountVectorizerModel(new String[]{"a", "b", "c"})
                .setInputCol("text")
                .setOutputCol("feature");

        cvm.transform(df).show(false);

Here is the Output. i would like to handle or delete the first column or is any other option to instantiate empty vectors to 0.0

+------------------+-------------------------+
|text              |feature                  |
+------------------+-------------------------+
|[zero, zero, zero]|(3,[],[])                |
|[a, b, c]         |(3,[0,1,2],[1.0,1.0,1.0])|
|[a, b, b, c, a]   |(3,[0,1,2],[2.0,2.0,1.0])|
+------------------+-------------------------+

i would appreciate if someone could help me do it in java

Upvotes: 1

Views: 801

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35229

To

instantiate empty vectors to 0.0

You don't have to do anything. (3,[],[]) is not empty - it is a SparseVector representation equivalent to DenseVector [0.0, 0.0, 0.0].

To delete:

You can create an udf:

import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.ml.linalg.Vector;

UDF1 isEmpty = new UDF1<Vector,Boolean>() {
    public Boolean call(Vector vector) throws Exception {
        return vector.toSparse().numActives() == 0;
    }
};

spark.udf().register("isEmpty", isEmpty, DataTypes.BooleanType);

and use it with SQLTransformer:

SQLTransformer sqlTrans = new SQLTransformer().setStatement(
  "SELECT * FROM __THIS__ WHERE NOT isEmpty(feature)");

but please don't - "empty" vector is a source of valuable information.

Upvotes: 1

Related Questions