Steve
Steve

Reputation: 406

CountVectorizer, using same vocabulary in a second time

Here my dataset:

anger,happy food food
anger,dog food food
disgust,food happy food
disgust,food dog food
neutral,food food happy
neutral,food food dog

Second, this is my code, where I perform a bag of words using CountVectorizer class.

classValues =  {'anger':    '0',
            'disgust':  '1',
            'fear':     '2',
            'happiness':'3',
            'sadness':  '4',
            'surprise': '5',
            'neutral':  '6'}

def getClass(line):
    parts = line.split(',')
    return float(classValues[parts[0]])

def getTags(line):
    parts = line.split(',')
    return parts[1].split(" ")

conf= SparkConf()
conf.setAppName("NaiveBaye")
conf.set('spark.driver.memory','6g')
conf.set('spark.executor.memory','6g')
conf.set('spark.cores.max',156)

sc = SparkContext(conf= conf)
sqlContext = SQLContext(sc)

data = sc.textFile('dataset.txt')

classes = data.map(getClass).collect()
tags = data.map(getTags).collect()

d = {
    'tags' : tags,
    'classes' : classes
}

df = sqlContext.createDataFrame(pd.DataFrame(data=d))
cv = CountVectorizer(inputCol="tags", outputCol="vectors")
model = cv.fit(df)
model.transform(df).show(truncate=False)

vocabulary =  sorted(map(str, model.vocabulary))
print vocabulary

As you can see here: model.transform(df).show(truncate=False) and print vocabulary works perfectly.

+-------+-------------------+-------------------+
|classes|tags               |vectors            |
+-------+-------------------+-------------------+
|0.0    |[happy, food, food]|(3,[0,1],[2.0,1.0])|
|0.0    |[dog, food, food]  |(3,[0,2],[2.0,1.0])|
|1.0    |[food, happy, food]|(3,[0,1],[2.0,1.0])|
|1.0    |[food, dog, food]  |(3,[0,2],[2.0,1.0])|
|6.0    |[food, food, happy]|(3,[0,1],[2.0,1.0])|
|6.0    |[food, food, dog]  |(3,[0,2],[2.0,1.0])|
+-------+-------------------+-------------------+
['dog', 'food', 'happy']

Now, if I want in a second time, perform a vectorizer of a new element using the same vocabulary how can I do this in python?

For example

anger, happy dog food

will be

|0.0    |[happy, dog, food]|(3,[0,1, 2],[1.0,1.0,1.0])|

I've read in the documentation that exist CountVectorizerModel that permit load an exist vocabulary. But there's not any documantation about this.

This is very important to me because if I need to classify a new element I need the same order of the vectors, in order to use the same model of my classificator.

I've tried somethings like this:

CountVectorizerModel(vocabulary)

but doesn't work.

Edit 1

I'm currently using Spark 1.6.1

Upvotes: 2

Views: 2174

Answers (1)

eliasah
eliasah

Reputation: 40360

From spark 2.0, it's available in pyspark, and it is like persisting and load other spark-ml models.

Ok let's first create a model :

from pyspark.ml.feature import CountVectorizer, CountVectorizerModel

# Input data: Each row is a bag of words with a ID.
df = spark.createDataFrame([
    (0, "a b c".split(" ")),
    (1, "a b b c a".split(" "))
], ["id", "words"])

# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)

model = cv.fit(df)

result = model.transform(df)
result.show(truncate=False)
# +---+---------------+-------------------------+
# |id |words          |features                 |
# +---+---------------+-------------------------+
# |0  |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
# |1  |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
# +---+---------------+-------------------------+

then persist it :

model.save("/tmp/count_vec_model")

now you can load it and use it :

same_model = CountVectorizerModel.load("/tmp/count_vec_model")
same_model.transform(df).show(truncate=False)
# +---+---------------+-------------------------+
# |id |words          |features                 |
# +---+---------------+-------------------------+
# |0  |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
# |1  |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
# +---+---------------+-------------------------+

For more information, please refer to the following documentation concerning Saving and loading spark-ml models/pipelines.

The model creation code example is available in the official documentation.

Upvotes: 2

Related Questions