How can I programatically measure the vagueness of text?

Question

I want to provide a service that finds job postings on other sites and lets users painlessly apply for those jobs.

What I would like to provide is a form of automated screening for postings; specifically, I'd like to add an option to filter out postings with vague language in case a user doesn't want job postings from 3rd party recruiters(since vague language is a tell-tale sign of those kind of postings).

Is there an algorithm which I can use to measure the vagueness or clarity-level of some text?

Nikita Astrakhantsev · Accepted Answer

As I understand, you need a classifier for job descriptions into 2 classes: "3rd parties" and "employers themselves". It is a classic text classification task, very similar to spam filtering.

Main differences from spam filtering are the following:

Vague boundary between classes: even human can't often determine the source of job description.
Almost no counteraction from authors of job descriptions.

So, I recommend to use supervised machine learning approach for your task. Create train set of job descriptions - it is not so hard to collect 100-200 of each type, and that would be enough, I guess. Then try ML classifiers like Random forest, Logistic regression or Naive Bayes with simple features like bag-of-words; name of the person who uploaded job description; length of the text; also try some binary features, e.g. presence of special words like the ones recommended by @Sklivvz♦.

Look at Naive Bayes spam filtering for example.

Your classes ("vague text" and "clear text") seem to be too vague for creating effective classifier. In addition, your assumption that this classification is equivalent to the classification I formulated above (and that is you really needed in), doesn't look like a reliable one.

How can I programatically measure the vagueness of text?

Answers (2)

Related Questions