steinunnfridriks
steinunnfridriks

Reputation: 3

Matching two-word variants with each other if they don't match alphabetically

I'm doing a NLP project with my university, collecting data on words in Icelandic that exist both spelled with an i and with a y (they sound the same in Icelandic fyi) where the variants are both actual words but do not mean the same thing. Examples of this would include leyti (an approximation in time) and leiti (a grassy hill), or kirkja (church) and kyrkja (choke). I have a dataset of 2 million words. I have already collected two wordlists, one of which includes words spelled with a y and one includes the same words spelled with a i (although they don't seem to match up completely, as the y-list is a bit longer, but that's a separate issue). My problem is that I want to end up with pairs of words like leyti - leiti, kyrkja - kirkja, etc. But, as y is much later in the alphabet than i, it's no good just sorting the lists and pairing them up that way. I also tried zipping the lists while checking the first few letters to see if I can find a match but that leaves out all words that have y or i as the first letter. Do you have a suggestion on how I might implement this?

Upvotes: 0

Views: 69

Answers (3)

steinunnfridriks
steinunnfridriks

Reputation: 3

So this accomplishes my task, kind of an easy not-that-pretty solution I suppose but it works:

wordlist = open("data.txt", "r", encoding='utf-8')
y_words = open("y_wordlist.txt", "w+", encoding='utf-8')
all_words = []
y_words = []

for word in wordlist:
    word = word.lower()
    all_words.append(word)

for word in all_words:
    if "y" in word:
        y_words.append(word)

word_dict = {}

for word in y_words:
    newwith1y = word.replace("y", "i",1)
    newwith2y = word.replace("y", "i",2)
    newyback = word[::-1].replace("y", "i",1)
    newyback = newyback[::-1]
    word_dict[word] = newwith1y
    word_dict[word] = newwith2y
    word_dict[word] = newyback

for key, value in word_dict.items():
    if value in all_words:
        y_wordlist.write(key)
        y_wordlist.write(" - ")
        y_wordlist.write(value)
        y_wordlist.write("\n")

Upvotes: 0

Amrith Krishna
Amrith Krishna

Reputation: 2853

I do not think this is a programming challenge, but looks more like an NLP challenge itself. The spelling variations are often a hurdle that one would face during preprocessing.

I would suggest that you use an Edit-distance based approaches to identify word pairs that allow for some variations. Specifically for the kind of problem that you have described above, I would recommend "Jaro Winkler Distance". This method allows to give higher similarity score between word pairs that shows variations between specific character pairs, say y and i.

All these approaches are implemented in Jellyfish library. You may also have a look at the fuzzywuzzy package as well. Hope this helps.

Upvotes: 0

DBaker
DBaker

Reputation: 2139

Try something like this:

s = "trydfydfgfay"
l = list(s)
candidateWords = []
for idx, c in enumerate(l):
    if c=='y':
        newList = l.copy()
        newList[idx] = "i"
        candidateWord = "".join(newList)
        candidateWords.append(candidateWord)
print(candidateWords)
#['tridfydfgfay', 'trydfidfgfay', 'trydfydfgfai']
#look up these words to see if they are real words  

Upvotes: 0

Related Questions