kayas
kayas

Reputation: 723

Create a N-gram model for custom vocabulary

I want to create an N-Gram model which will not work with "English words". I have a custom vocabulary list like below:

vocabs = [ [0.364, 0.227, 0.376], [0.875, 0.785, 0.376], ........ ]

What I am trying to say is, each element in my vocabs list needs to be considered as a "word" by the N-Gram models. And my training dataset will have some numbers exactly in the same format of my vocabs list, like below:

training_data = [ [0.344, 0.219, 0.374], [0.846, 0.776, 0.376], ........ ]

Note: In the example I wanted to show that, the training "words" (list of 3 number) are not exactly the same as the "words" in my vocabulary but they will be very close.

Now, my question is, can I build an N-Gram model which can be trained using the training data? And later, use that model to predict the probability of a new "word" as it comes.

I am using python and can find a lot of N-Gram examples using the "nltk" library. But the problem is in most cases "English words" are used. As I am not very familiar with N-Grams, these examples made me confused. I will be very happy if anyone can answer my questions and/or point out some tutorials to learn N-Grams in general (not specific to NLP).

Thanks.

Edit:

Just to simplify the question, I will try to explain it in a different way: I have a vocabulary like below:

vocabs = [v1, v2, v3, ........vn]

I also have two sequence generator(SG). Both of them generates a sequence of words from my vocabulary.

My goal is to predict from the streaming data: which generator is currently generating the sequence(words).

Now I want to build two N-gram models(one for each SG) using my labeled training data(I already have some labeled data from the SGs). Finally, when I feed my streaming data into the models and select the probable SG by comparing predictions from the N-gram models. Just to be clear if the N-gram model for SG1 gives higher probability than the N-gram model for SG2, I will decide that the current streaming data is generated by SG1.

Hope the explanation helps to understand my concern. I really appreciate the effort to answer this question.

Note: If you know any other models that can solve this problem well (better than N-gram model), please mention them.

Thanks.

Upvotes: 1

Views: 992

Answers (2)

dahrs
dahrs

Reputation: 91

In that case, you can definitely use n-grams.

First let me make sure I understood:

You have a vocabulary:

vocabs = [v0, v1, v2, v3, ........vn]

and 2 sequence generators who take elements from your vocab and return a list of vocab sequences:

sg1 = [v0, v1, v2, v1, v3] sg2 = [v2, v4, v6, v2, v8]

Now, if I understand correctly, you want to use n-grams to artificially replicate and augment your sg1 and sg2 outputs:

ngramSg1 = [v0, v1, v3, v0, v1, v2, v1, v2, ...] ngramSg2 = [v2, v4, v6, v2, v8, v2, v6, v2, ...]

Then, you want to use an ML model to determine from what is the origin of the n-grams output (either SG1 or SG2). Am I right? Am I circling down on it?

If it's like what I described, then you should be able to use the code I wrote on the previous answer or any n-gram library you want. NEVERTHELESS, if I understood correctly, your vocabulary is made of lists of numbers and not individual objects. If that's the case then you probably won't find any library who can handle that. It's way too specific. You might have to code your own version of n-gram-based sequence generator.

However, your case looks kind of familiar to word embeddings (which are basically language processing algorithms that use vectors as a representation on words). If you don't already know about them you might want to check out gensim's word2vec, doc2vec or fastText and either adopt them or adapt them.

Upvotes: 1

dahrs
dahrs

Reputation: 91

Ok, I'm not sure of what you want to do exactly. But let's try anyway.

First how do N-gram works: N-grams are -quite simple- predictors of the probability of a sequence. Since sentences are just sequences of words and words are just a sequence of characters, it generally works great for strings:

Problem: you have a list of letters and you want to find out what would be the next letter in the sequence.

letterSequence = ['a', 'b', None] 

If you have a bunch of letters in sequence, you can take note of what are those sequences:

training_data = ['a', 'b', 'c',
             'a', 'b', 'c',
             'a', 'b', 'd',
             'a', 'b', 'f',
             'b', 'c', 'd']

At first glance, you can see that the probability of having a sequence 'a','b','c' are twice greater than the probability of having 'a','b','d' or 'a','b','f'. What we're going to do is to count how many times a same sequence appears in training_data and select the one that appears more often.

def makeNestedDict(aDict, listOfKeys):
    if len(listOfKeys) == 0: 
        if aDict != {}: return aDict
        return 0
    if listOfKeys[0] not in aDict:
        aDict[listOfKeys[0]] = {}
    aDict[listOfKeys[0]] = makeNestedDict(aDict[listOfKeys[0]], listOfKeys[1:])
    return aDict

def makeCoreferenceDict(ressource):
    #we'll use 3-grams but we could have chosen any n for n-grams
    ngramDict = {}
    index = 0
    #we make sure we won't go further than the length of the list
    while (index+2) < len(ressource):
        k1 = ressource[index]
        k2 = ressource[index+1]
        k3 = ressource[index+2]
        ngramDict = makeNestedDict(ngramDict, [k1, k2, k3])            
        ngramDict[k1][k2][k3] += 1 #counting
        index += 1
    return ngramDict

def predict(unkSequence, ngramDict):
    import operator
    corefDict = ngramDict[unkSequence[0]][unkSequence[1]]
    return max(corefDict.items(), key=operator.itemgetter(1))

############################################
ngramDict = makeCoreferenceDict(training_data)
#the most common letter that follows 'a', 'b' is... 
predict(letterSequence, ngramDict)
>>> ('c', 2) #... is 'c' and it appears twice in the data

You can also get a prediction score instead of getting the most common element by replacing the line (in the makeCoreferenceDict function):

ngramDict[k1][k2][k3] += 1 #counting

with:

ngramDict[k1][k2][k3] += 1.0/float(len(ressource)) #add to the score

so:

def showScore(unkSequence, ngramDict):
    return ngramDict[unkSequence[0]][unkSequence[1]]

############################################
ngramDict = makeCoreferenceDict(training_data)
#the most common letter that follows 'a', 'b' is... 
showScore(letterSequence, ngramDict)
>>> {'c': 0.13333333333333333, 'd': 0.06666666666666667, 'f': 0.06666666666666667}

NOW the n-gram method depends on having a finite set of elements (characters, words, natural numbers, etc.). Now, in YOUR example, "vocabs" and "training_data" have barely any numbers in common. And I think that what you really need, is to get a distance score between your words. I'm guessing that because of what you said:

In the example I wanted to show that, the training "words" (list of 3 number) are not exactly the same as the "words" in my vocabulary but they will be very close.

In that case it gets a little too complicated to show it here but you might want to measure the distance between

each number of each element in "vocabs"

and

each number of each element in each sequence in "training_data"

and then compare them and choose the smaller score.

If that's not the answer to your question, please reformulate or give us more examples. In any case, good luck with that.

Upvotes: 1

Related Questions