Reputation: 13833
So I am trying to classify documents bases on its texts with Naive Bayes. Each document might belong to 1 to n categories (think of it as tags in a blog post).
My current approach is to provide R with a csv looking like this
+-------------------------+---------+-------+-------+
| TEXT TO CLASSIFY | Tag 1 | Tag 2 | Tag 3 |
+-------------------------+---------+-------+-------+
| Some text goes here | Yes | No | No |
+-------------------------+---------+-------+-------+
| Some other text here | No | Yes | Yes |
+-------------------------+---------+-------+-------+
| More text goes here | Yes | No | Yes |
+-------------------------+---------+-------+-------+
Of course the desired behaviour is to have an input looking like
Some new text to classify
And an output like
+------+------+-------+
| Tag 1| Tag 2| Tag 3 |
+------+------+-------+
| 0.12 | 0.75 | 0.65 |
+------+------+-------+
And then based on a certain threshold, determine whether or not the given text belongs to tags 1, 2, 3.
Now the question is, in the tutorials I have found, it looks like the input should be more like
+--------------------------+---------+
| TEXT TO CLASSIFY | Class |
+--------------------------+---------+
| Some other text here | No |
+--------------------------+---------+
| Some other text here | Yes |
+--------------------------+---------+
| Some other text here | Yes |
+--------------------------+---------+
That is, a ROW per text per class... Then using that yes, i can train naive bayes and then use one-vs-all in order to determine which texts belongs to which tags. Question is, can I do this in a more elegant way (that is, with the training data looking like the first example I mentioned)?
One of the examples I found is http://blog.thedigitalgroup.com/rajendras/2015/05/28/supervised-learning-for-text-classification/
Upvotes: 2
Views: 3568
Reputation: 4101
There are conceptually two approaches.
As always in probabilistic modelling is the question whether you assume that your tags are independent or not. In the spirit of Naive Bayes the independence assumption would be very natural. In that case 2. would be the way to go. If the independence assumption is not justified and you are afraid of the combinatorial explosion, you can use a standard Bayesian Network. If you keep certain assumptions your performance will not be impacted.
However, you could also assume a mixed a approach.
http://link.springer.com/article/10.1007%2Fs10994-006-6136-2#/page-1
Upvotes: 1