hmghaly
hmghaly

Reputation: 1502

High Precision Word Alignment Algorithm in Python

I am working on a project for building a high precision word alignment between sentences and their translations in other languages, for measuring translation quality. I am aware of Giza++ and other word alignment tools that are used as part of the pipeline for Statistical Machine Translation, but this is not what I'm looking for. I'm looking for an algorithm that can map words from the source sentence into the corresponding words in the target sentence, transparently and accurately given these restrictions:

Here is what I did:

Here is an example of a correlation matrix between an English and a German sentence. We can see the challenges discussed above.

An example of the alignment between an English and German sentence, showing the correlations between words, and the green cells are the correct alignment points that should be identified by the word-alignment algorithm

In the image, there is an example of the alignment between an English and German sentence, showing the correlations between words, and the green cells are the correct alignment points that should be identified by the word-alignment algorithm.

Here is some of what I tried:

Here is the code I am using:

import random
src_words=["I","know","this"]
trg_words=["Ich","kenne","das"]
def match_indexes(word1,word2):
    return random.random() #adjust this to get the actual correlation value

all_pairs_vals=[] #list for all the source (src) and taget (trg) indexes and the corresponding correlation values
for i in range(len(src_words)): #iterate over src  indexes
    src_word=src_words[i] #identify the correponding src word
    for j in range(len(trg_words)): #iterate over trg indexes
        trg_word=trg_words[j] #identify the correponding trg word
        val=match_indexes(src_word,trg_word) #get the matching value from the inverted indexes of     each word (or from the data provided in the speadsheet)
        all_pairs_vals.append((i,j,val)) #add the sentence indexes for scr and trg, and the corresponding val

all_pairs_vals.sort(key=lambda x:-x[-1])  #sort the list in descending order, to get the pairs with the highest correlation first
selected_alignments=[]
used_i,used_j=[],[] #exclude the used rows and column indexes
for i0,j0,val0 in all_pairs_vals:
    if i0 in used_i: continue #if the current column index i0 has been used before, exclude current pair-value
    if j0 in used_j: continue #same if the current row was used before
    selected_alignments.append((i0,j0)) #otherwise, add the current pair to the final alignment point selection
    used_i.append(i0) #and include it in the used row and column indexes so that it will not be used again
    used_j.append(j0)

for a in all_pairs_vals: #list all pairs and indicate which ones were selected
    i0,j0,val0=a
    if (i0,j0) in selected_alignments: print(a, "<<<<")
    else: print(a)

It's problematic because it doesn't accomodate the many-to-many, or even the one to many alignments, and can err easily in the beginning by selecting a wrong pair with highest correlation, excluding its row and column from future selection. A good algorithm would factor in that a certain pair has the highest correlation in its respective row/column, but would also consider the proximity to other pairs with high correlations.

Here is some data to try if you like, it's in Google sheets: https://docs.google.com/spreadsheets/d/1-eO47RH6SLwtYxnYygow1mvbqwMWVqSoAhW64aZrubo/edit?usp=sharing

Upvotes: 8

Views: 5258

Answers (4)

Chiarcos
Chiarcos

Reputation: 354

As the question is specifically addressing Python implementations, and Giza++ and FastAlign still seem to represent SOTA, one might look into

Most research code on the topic will nowadays come in Python and be based on embeddings, e.g., https://github.com/cisnlp/simalign, https://github.com/neulab/awesome-align, etc. However, the jury is still out on whether they outperform the older models and if so, for which applications. In the end, you need to go for a compromise between context awareness (reordering!), precision, recall and runtime. Neural models have great potential on being more context aware, statistical models have more predictable behavior.

Upvotes: 1

MZe
MZe

Reputation: 168

I highly recommend testing Awesome-Align. It relies on multilingual BERT (mBERT) and the results look very promising. I even tested it with Arabic, and it did a great job on a difficult alignment example since Arabic is a morphology-rich language, and I believe it would be more challenging than a Latin-based language such as German.

enter image description here

As you can see, one word in Arabic corresponds to multiple words in English, and yet Awesome-Align managed to handle the many-to-many mapping to a great extent. You may give it a try and I believe it will meet your needs.

There is also a Google Colab demo at https://colab.research.google.com/drive/1205ubqebM0OsZa1nRgbGJBtitgHqIVv6?usp=sharing#scrollTo=smW6s5JJflCN

Good luck!

Upvotes: 6

Jindřich
Jindřich

Reputation: 11240

Recently, there were also two papers using bi-/multilingual word/contextual embeddings to do the word alignment. Both of them construct a bipartite graph where the words are weighted with their embedding distances and use graph algorithms to get the alignment.

One paper does a maximum matching between the graph parts. Because the matching is not symmetrical, they do it from both sides and use similar symmetrization heuristics as FastAlign.

The other one mentions the alignment only briefly uses minimum-weighted edge cover on the graph and uses it as the alignment.

Both of them claim to be better than FastAlign.

Upvotes: 4

ales_t
ales_t

Reputation: 2017

Word alignment remains an open research topic to some extent. The probabilistic models behind Giza++ are fairly non-trivial, see: http://www.ee.columbia.edu/~sfchang/course/svia/papers/brown-machine-translate-93.pdf

There is a lot of existing approaches you could take, such as:

This is a very difficult machine learning problem and while it's not impossible that simple approaches such as yours could work, it might be a good idea to study the existing work first. That being said, we have seen quite a few breakthroughs from surprisingly simple techniques in this field so who knows :-)

Upvotes: 6

Related Questions