Reputation: 406
Here my dataset:
anger,happy food food
anger,dog food food
disgust,food happy food
disgust,food dog food
neutral,food food happy
neutral,food food dog
Second, this is my code, where I perform a bag of words using CountVectorizer class.
classValues = {'anger': '0',
'disgust': '1',
'fear': '2',
'happiness':'3',
'sadness': '4',
'surprise': '5',
'neutral': '6'}
def getClass(line):
parts = line.split(',')
return float(classValues[parts[0]])
def getTags(line):
parts = line.split(',')
return parts[1].split(" ")
conf= SparkConf()
conf.setAppName("NaiveBaye")
conf.set('spark.driver.memory','6g')
conf.set('spark.executor.memory','6g')
conf.set('spark.cores.max',156)
sc = SparkContext(conf= conf)
sqlContext = SQLContext(sc)
data = sc.textFile('dataset.txt')
classes = data.map(getClass).collect()
tags = data.map(getTags).collect()
d = {
'tags' : tags,
'classes' : classes
}
df = sqlContext.createDataFrame(pd.DataFrame(data=d))
cv = CountVectorizer(inputCol="tags", outputCol="vectors")
model = cv.fit(df)
model.transform(df).show(truncate=False)
vocabulary = sorted(map(str, model.vocabulary))
print vocabulary
As you can see here: model.transform(df).show(truncate=False)
and print vocabulary
works perfectly.
+-------+-------------------+-------------------+
|classes|tags |vectors |
+-------+-------------------+-------------------+
|0.0 |[happy, food, food]|(3,[0,1],[2.0,1.0])|
|0.0 |[dog, food, food] |(3,[0,2],[2.0,1.0])|
|1.0 |[food, happy, food]|(3,[0,1],[2.0,1.0])|
|1.0 |[food, dog, food] |(3,[0,2],[2.0,1.0])|
|6.0 |[food, food, happy]|(3,[0,1],[2.0,1.0])|
|6.0 |[food, food, dog] |(3,[0,2],[2.0,1.0])|
+-------+-------------------+-------------------+
['dog', 'food', 'happy']
Now, if I want in a second time, perform a vectorizer of a new element using the same vocabulary how can I do this in python?
For example
anger, happy dog food
will be
|0.0 |[happy, dog, food]|(3,[0,1, 2],[1.0,1.0,1.0])|
I've read in the documentation that exist CountVectorizerModel that permit load an exist vocabulary. But there's not any documantation about this.
This is very important to me because if I need to classify a new element I need the same order of the vectors, in order to use the same model of my classificator.
I've tried somethings like this:
CountVectorizerModel(vocabulary)
but doesn't work.
I'm currently using Spark 1.6.1
Upvotes: 2
Views: 2174
Reputation: 40360
From spark 2.0, it's available in pyspark
, and it is like persisting and load other spark-ml
models.
Ok let's first create a model :
from pyspark.ml.feature import CountVectorizer, CountVectorizerModel
# Input data: Each row is a bag of words with a ID.
df = spark.createDataFrame([
(0, "a b c".split(" ")),
(1, "a b b c a".split(" "))
], ["id", "words"])
# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)
model = cv.fit(df)
result = model.transform(df)
result.show(truncate=False)
# +---+---------------+-------------------------+
# |id |words |features |
# +---+---------------+-------------------------+
# |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
# |1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
# +---+---------------+-------------------------+
then persist it :
model.save("/tmp/count_vec_model")
now you can load it and use it :
same_model = CountVectorizerModel.load("/tmp/count_vec_model")
same_model.transform(df).show(truncate=False)
# +---+---------------+-------------------------+
# |id |words |features |
# +---+---------------+-------------------------+
# |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
# |1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
# +---+---------------+-------------------------+
For more information, please refer to the following documentation concerning Saving and loading spark-ml models/pipelines.
The model creation code example is available in the official documentation.
Upvotes: 2