How correctly make TF-IDF vectors of sentences in Apache Spark with Java?

Question

I have this code,

public class TfIdfExample {
        public static void main(String[] args){
            JavaSparkContext sc = SparkSingleton.getContext();
            SparkSession spark = SparkSession.builder()
                    .config("spark.sql.warehouse.dir", "spark-warehouse")
                    .getOrCreate();
            JavaRDD> documents = sc.parallelize(Arrays.asList(
                    Arrays.asList("this is a sentence".split(" ")),
                    Arrays.asList("this is another sentence".split(" ")),
                    Arrays.asList("this is still a sentence".split(" "))), 2);


            HashingTF hashingTF = new HashingTF();
            documents.cache();
            JavaRDD featurizedData = hashingTF.transform(documents);
            // alternatively, CountVectorizer can also be used to get term frequency vectors

            IDF idf = new IDF();
            IDFModel idfModel = idf.fit(featurizedData);

            featurizedData.cache();

            JavaRDD tfidfs = idfModel.transform(featurizedData);
            System.out.println(tfidfs.collect());
            KMeansProcessor kMeansProcessor = new KMeansProcessor();
            JavaPairRDD result = kMeansProcessor.Process(tfidfs);
            result.collect().forEach(System.out::println);
        }
    }

I need get Vectors for k-means, but I getting odd Vectors

[(1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),
     (1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),
     (1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0])]

after k-means work I getting it

((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),1)
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),0)
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),1)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),1)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),1)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),0)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),1)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),0)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),1)

But I think it work not correctly, because tf-idf must have another view. I think mllib have ready methods for this, but I tested documentation examples and don't receive what I need. Custom solution for Spark I have not found. May be somebody work with it and give me answer what I doing wrong? May be I am not correctly use mllib functional?

Alexey Svyatkovskiy · Accepted Answer

What you are getting after TF-IDF is a SparseVector.

To understand the values better, let me start with TF vectors:

(1048576,[489554,540177,736740,894973],[1.0,1.0,1.0,1.0])
(1048576,[455491,540177,736740,894973],[1.0,1.0,1.0,1.0])
(1048576,[489554,540177,560488,736740,894973],[1.0,1.0,1.0,1.0,1.0])

For instance, TF vector corresponding to the first sentence is a 1048576 (= 2^20) component vector, with 4 non-zero values corresponding to indices the 489554,540177,736740 and 894973, all other values are zeros and therefore not stored in the sparse vector representation.

The dimensionality of the feature vectors is equal to the number of buckets you hash into: 1048576 = 2^20 buckets in your case.

For a corpus of this size, you should consider reducing the number of buckets:

HashingTF hashingTF = new HashingTF(32);

powers of 2 are recommended to minimize number of hash collisions.

Next, you apply IDF weights:

(1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0])
(1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0])
(1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0])

If we look at the first sentence again, we got 3 zeros - which is expected, since the terms "this", "is", and "sentence" appear in every document of the corpus, so by definition of IDF will be equal to zero.

Why do the zero values still in the (sparse) vector? Because in the current implementation, the size of the vector is kept the same and only the values are multiplied by IDF.

How correctly make TF-IDF vectors of sentences in Apache Spark with Java?

Answers (1)

Related Questions