Reputation: 30492

How to find all words that have the same appearence in two different languages?

The russian alphabet includes many letters that are the same in the English alphabet. Here is the list of common letters: L='acekopuxy'

Now, given two huge lists R and E, each in the form [word_A, word_B, ...], where each word_N is a lowercase word, I want to create a list C, which should contain only those words that have the same spelling in E and in R. For example, a word 'cop' must be in C, because it is in the list R as well as in E.

Is there any polynomial way to do it?

P.S.: One important note: because of the different character encodings, there are two L lists, LE for English letters and LR for Russian, but the appearance of their letters is the same:

LE='acekopuxy'
LR='асекориху'

Upvotes: 0

Answers (4)

Fred Foo

Reputation: 363597

Eset = set(E)
C = [w for w in R if w.replace(LR,LE) in Eset]

Not sure if I understood the problem correctly, but assuming good hashing, this runs in O(n).

Upvotes: 1

Tim Pietzcker

Reputation: 336178

You can use regular expressions for this:

^[acekopuxy]+$

will match words that contain only those characters.

import re
regex = re.compile(r"^[acekopuxy]+$", re.I)
output = []
for word in mylist:
    if regex.match(word):
        output.append(word)

You'll need to do this for both lists, using the correct encodings. That means that for the Russian list, you'll need to use the equivalent characters, like ^[\u0441\u1234...]$.

Then, if you want to find the words that "look the same", you could use a translation table to convert the words in one of the list into the format of the other list, then convert the lists to sets, and check their intersection.

Upvotes: 1

Luper Rouch

Reputation: 9492

You can use sets for this:

english_set = set(E)
russian_set = set(R)
common_words = english_set.intersection(russian_set)

I'm not sure I got the encoding part right though, but if that means letters that look similar are actually different bytes, you can for example prepare the russian list by replacing these letters by their english counterpart prior to doing the intersection.

Upvotes: 3

Boldewyn

Reputation: 82734

You need to tell the program yourself, which characters are similar. Since they are each different Unicode codepoints, you will have to have a mapping like this:

var RE_map = (
  (u'c', u'\u0441'),
  # ...and so on
)

Then, translate all words from R to their E representation:

for ec, rc in RE_map:
    string = string.replace(rc, ec)

and finally check, if the string is now in E:

if string in E:
    print "The word exists of characters similar in Latin and Cyrillic."

Upvotes: 1

How to find all words that have the same appearence in two different languages?

Answers (4)

Related Questions