Reputation: 125660
The documentation of ML.NET shows how to use context.Transforms.Text.ProduceWordBags
to get word bags. The method takes Transforms.Text.NgramExtractingEstimator.WeightingCriteria
as one of the parameters, so it's possible to request TfIdf
weights to be used. The simplest example would be:
// Get a small dataset as an IEnumerable and then read it as a ML.NET data set.
IEnumerable<SamplesUtils.DatasetUtils.SampleTopicsData> data = SamplesUtils.DatasetUtils.GetTopicsData();
var trainData = ml.Data.LoadFromEnumerable(data);
var pipeline = ml.Transforms.Text.ProduceWordBags("bags", review, ngramLength: 1, weighting: Transforms.Text.NgramExtractingEstimator.WeightingCriteria.TfIdf);
var transformer = pipeline.Fit(trainData);
var transformed_data = transformer.Transform(trainData);
That's all fine, but how do I get the actual results out of transformed_data
?
I did some digging in a debugger, but I'm still quite confused on what's actually happening here.
First of all, running the pipeline adds three extra columns to transformed_data
:
After getting a preview of the data I can see what's in these columns. To make things clearer here's what GetTopicsData
returns, which is what we're running our transform on:
animals birds cats dogs fish horse
horse birds house fish duck cats
car truck driver bus pickup
car truck driver bus pickup horse
That's exactly what I'm seeing in the very first bags
column, typed as Vector<string>
:
Moving on to the second bags
column, typed as Vector<Key<UInt32, 0-12>>
(no idea what 0-12
is here btw.).
This one has KeyValues
annotation on it and it looks like for each row it maps the words into indexes in global Vocabulary array.
The Vocabulary array is part of Annotations
:
So that's promissing. You'd think the last bags
column, typed as Vector<Single, 13>
would have the weights for each of the words! Unfortunately, that's not what I'm seeing. First of all, the same Vocabulary array is present in Annotations
:
And the values in rows are 1
/0
, which is not what TfIdf should return:
So to me that looks more like "Is word i
from the Vocabulary present in current row" and not the TfIdf frequency of it, which is what I'm trying to get.
Upvotes: 4
Views: 1205