Ten Bitcomb
Ten Bitcomb

Reputation: 2374

How can I programatically measure the vagueness of text?

I want to provide a service that finds job postings on other sites and lets users painlessly apply for those jobs.

What I would like to provide is a form of automated screening for postings; specifically, I'd like to add an option to filter out postings with vague language in case a user doesn't want job postings from 3rd party recruiters(since vague language is a tell-tale sign of those kind of postings).

Is there an algorithm which I can use to measure the vagueness or clarity-level of some text?

Upvotes: 2

Views: 677

Answers (2)

Nikita Astrakhantsev
Nikita Astrakhantsev

Reputation: 4749

As I understand, you need a classifier for job descriptions into 2 classes: "3rd parties" and "employers themselves". It is a classic text classification task, very similar to spam filtering.

Main differences from spam filtering are the following:

  1. Vague boundary between classes: even human can't often determine the source of job description.
  2. Almost no counteraction from authors of job descriptions.

So, I recommend to use supervised machine learning approach for your task. Create train set of job descriptions - it is not so hard to collect 100-200 of each type, and that would be enough, I guess. Then try ML classifiers like Random forest, Logistic regression or Naive Bayes with simple features like bag-of-words; name of the person who uploaded job description; length of the text; also try some binary features, e.g. presence of special words like the ones recommended by @Sklivvz♦.

Look at Naive Bayes spam filtering for example.

Your classes ("vague text" and "clear text") seem to be too vague for creating effective classifier. In addition, your assumption that this classification is equivalent to the classification I formulated above (and that is you really needed in), doesn't look like a reliable one.

Upvotes: 3

Sklivvz
Sklivvz

Reputation: 31143

I wrote something similar, even though not exactly what you ask, on my website for Careers Stack Overflow.

There are some phrases which commonly indicate a vague job ad: corporate jargon words. While it's pretty hard to determine whether a single word or phrase is actually used in a jargon-y way, it becomes quite evident that many bad postings have many matches - they use many such words.

You can test the tool here and there's more explanations on the site.

Regarding the code, it's simply a series of static compiled regexes. Simple and works for my needs.

void Main()
{

    string test = "developer-centric vision of insourcing";
    var matches = BadChecks.SelectMany(bad => 
        bad.Matches(test)
           .Cast<Match>()
           .Select(m => m.Value.ToLowerInvariant())
        ).ToList();

    foreach (var res in matches)
        Console.WriteLine(res);

}

private static readonly List<Regex> BadChecks = SetupBadChecks();

private static List < Regex > SetupBadChecks() {
    return new List < string > {
        "(#1|number (one|1))",
        "([a-z]+)-free",
        "(Out|in)sourcing",
        "-centric",
        "a wider net",
        "Aggregator",
        "Alignment",
        "all hands on deck",
        //  more
        "Wellness",
        "Win(-| )win",
        "World(-| )class"
    }.Select(s => new Regex(s, RegexOptions.IgnoreCase |
                               RegexOptions.CultureInvariant |
                               RegexOptions.Compiled))
     .ToList();
}

Which returns

insourcing
-centric

Upvotes: 2

Related Questions