Spark: Word classification

Question

I got a question about word classification in Spark. I am working on a simple classification model that takes a word (a single word), as an input and its predict the race of the named person (it is from a fictitious universe). For example, Gimli -> dwarf, Legolas -> elf.

My issue is on how to process the words. I know that Spark includes a two feature vectorization methods, tf–idf and word2vec. However, I am having difficulties on understanding them and do not know which one to use.

Could anyone explained them to me and guide through the process?. And more importantly, I would like to know which of these methods is the most appropriate for this case.

Thanks

Adam Bittlingmayer · Accepted Answer

Firstly we should be clear that the correct approach will depend on the data. *

This task is called language detection or language identification. Even for entire sentences or pages, vectors from entire words is not the right approach. (It would only work on names you have actually encountered in training, like a list, no real prediction.) Rather, you need an n-gram model based on characters. For example, in bigram model:
"Gimli" --> "_G Gi im ml li i_"

Unfortunately you cannot use pyspark.ml.feature.NGram for this out of the box, because it assumes a gram is a word, not a character.

What to do?

You must first find or write a function to do this transform to character n-grams, and apply it to both the original names and to queries that come into your system. (If names have spaces, treat those as a character too.)

Then, in Spark terminology, these character n-grams are your "words", and the string containing all (eg "_G Gi im ml li i_") of them is your "document".

(And, if you like, you can now use NGram: splitting words into ['G i m l i'] and then using NGram with n=2 should be equivalent to splitting into ['_G', 'Gi', 'im'...].)

Once you frame it in that way, it will be a flavour of the standard document classification problem (actually "regression" in strict Spark terminology), for which Spark has a few options. The main thing to note is that order is important, do not use approaches that treat it like a bag of words. So, although all of the Spark classification examples to be found vectorise with TF-IDF (and it will not completely fail in your case), it will be suboptimal because I assume that actually order/context of each character n-gram is important.

As far as optimising it for accuracy, there are possible refinements around alphabets, special chars, case sensitivity, stemming etc. It depends on your data - see below. (It would be interesting if you posted a link to the entire dataset.)

: * Regarding the data and assumptions about it:
the character n-gram approach works well for identifying actual human languages from planet earth. Even for human languages, there are special cases for classes of text like names, for example Chinese characters could be used, or languages like Haitian or Tagolog where many of the names are just French or Spanish, or Persian or Urdu where they are just Arabic - pronounced differently but spellt the same.)

We know the basic problems and techniques for words from major human languages but, for all we know, the names in your data: - are in random or mixed alphabets - contain special characters like '/' or '_' normally more likely seen in URLs - are numbers

Likewise interesting is the question of how they correlate to group membership. For example it could be that the named are randomly generated from alphabetic chars, or simply a list of English names, or generated using any other other approach and then randomly assigned to class A or B. In this case it is not possible predict whether names yet unseen are members of A or B. It is also possible that As are named for the day of the week on which they were born, but Bs for the day of the week on which they were conceived. In this case it is possible but not without more information.

In other scenario, again the same generator is used, but names are assigned to A or B based on: - length ie number (< or >= some cutoff) of chars/bytes/vowels/uppercase - length ie number (even or odd) of ... In these cases a completely different set of features must be extracted.

In yet another scenario, names of B are always repeated blocks like 'johnjohn'. In this case character n-gram frequencies can work better than random guessing, but is not the optimal approach.

So you will always need some intuition about the problem. It's difficult for us to make assumptions about an artificial world, from the 2 examples you have given we might assume the names are somewhat Englishish. And finally you must try different approaches and features (and ideally whatever classifier you choose simply ignores useless signals). At least in the real world, features like word count, char count and byte count are actually useful for this problem - they can augment the character n-gram approach.

Spark: Word classification

Answers (2)

Related Questions