CountVectorizer, using same vocabulary in a second time

Question

Here my dataset:

anger,happy food food
anger,dog food food
disgust,food happy food
disgust,food dog food
neutral,food food happy
neutral,food food dog

Second, this is my code, where I perform a bag of words using CountVectorizer class.

classValues =  {'anger':    '0',
            'disgust':  '1',
            'fear':     '2',
            'happiness':'3',
            'sadness':  '4',
            'surprise': '5',
            'neutral':  '6'}

def getClass(line):
    parts = line.split(',')
    return float(classValues[parts[0]])

def getTags(line):
    parts = line.split(',')
    return parts[1].split(" ")

conf= SparkConf()
conf.setAppName("NaiveBaye")
conf.set('spark.driver.memory','6g')
conf.set('spark.executor.memory','6g')
conf.set('spark.cores.max',156)

sc = SparkContext(conf= conf)
sqlContext = SQLContext(sc)

data = sc.textFile('dataset.txt')

classes = data.map(getClass).collect()
tags = data.map(getTags).collect()

d = {
    'tags' : tags,
    'classes' : classes
}

df = sqlContext.createDataFrame(pd.DataFrame(data=d))
cv = CountVectorizer(inputCol="tags", outputCol="vectors")
model = cv.fit(df)
model.transform(df).show(truncate=False)

vocabulary =  sorted(map(str, model.vocabulary))
print vocabulary

As you can see here: model.transform(df).show(truncate=False) and print vocabulary works perfectly.

+-------+-------------------+-------------------+
|classes|tags               |vectors            |
+-------+-------------------+-------------------+
|0.0    |[happy, food, food]|(3,[0,1],[2.0,1.0])|
|0.0    |[dog, food, food]  |(3,[0,2],[2.0,1.0])|
|1.0    |[food, happy, food]|(3,[0,1],[2.0,1.0])|
|1.0    |[food, dog, food]  |(3,[0,2],[2.0,1.0])|
|6.0    |[food, food, happy]|(3,[0,1],[2.0,1.0])|
|6.0    |[food, food, dog]  |(3,[0,2],[2.0,1.0])|
+-------+-------------------+-------------------+
['dog', 'food', 'happy']

Now, if I want in a second time, perform a vectorizer of a new element using the same vocabulary how can I do this in python?

For example

anger, happy dog food

will be

|0.0    |[happy, dog, food]|(3,[0,1, 2],[1.0,1.0,1.0])|

I've read in the documentation that exist CountVectorizerModel that permit load an exist vocabulary. But there's not any documantation about this.

This is very important to me because if I need to classify a new element I need the same order of the vectors, in order to use the same model of my classificator.

I've tried somethings like this:

CountVectorizerModel(vocabulary)

but doesn't work.

Edit 1

I'm currently using Spark 1.6.1

CountVectorizer, using same vocabulary in a second time

Edit 1

Answers (1)

Related Questions