Implementation advice on semi-supervised automated tagging

Question

I'm wondering what approaches exist to develop an automated tagging system. I'm building a company-internal feedback platform and our business users wish to add tags to the posts. I'd like to build a system that suggests tags to users as they post, allowing the user to correct the suggestions and having the system learn from those suggestions. We have a couple tags that we want to use initially, but allow users to add more as necessary.

I'm aware of the LDA algorithm and Kea/Mallet, but these seem like incomplete solutions. I'd like to add our predefined tags to the existing posts, and then have those as a guide for the system moving forward.

Just looking for some advice on how to proceed. One problem is the dataset is currently very small (~90 posts).

Thanks!

Robotijn · Accepted Answer

For this exact problem I have written a PhD thesis which I called Generative AI. Since you probably are not going to read the thesis here is the general algorithm for these kind of problems:

1) normalize the data: make certain that the range is between 0 and 1, or -1 and 1 if you have numbers; if you have words/names use only lowercase (or only uppercase); if you have both, split the data in numbers and other labels and make it a multiple classifier system.

2) Use KNN (K-nearest neighbor) until the categories are becoming large enough (typically for the first few hundred items in a class/category you can use KNN). Try different settings to optimize the results. Play with the K (typically I use the range 1 to 21, always uneven numbers) and the distance function. Scipy has decent implementations that are easy to use.

Also, use the ranking of the label to influence the decision. For example, if you have a K of 11, then the first item you get back all the labels get 11 points. The 2nd item the labels get 10 points etc. Then collect the labels and show the best (N) label(s) depending on the amount of points the labels got.

Then show the label(s) to the user so the user can give feedback and the system can update itself. The advantage of showing more labels is that the user has to type less.

3) Once you have enough items you should replace the KNN algorithm with support vectors machines. Often linear support vector machines are good enough. For the optimization of (linear) support vector machines use a grid search on the parameters.

The basic idea is that you have a system that is generating hypotheses (the labels in this case) and that the user is giving feedback, often in a production system, so that the AI can optimize itself.

If you are very interested here is my PhD Thesis:

https:/irs.ub.rug.nl/dbi/4c86122381f79

At the moment I use it for robots that learn in real-time...

Implementation advice on semi-supervised automated tagging

Answers (1)

Related Questions