Reputation: 30492
The russian alphabet includes many letters that are the same in the English alphabet. Here is the list of common letters: L='acekopuxy'
Now, given two huge lists R and E, each in the form [word_A, word_B, ...], where each word_N is a lowercase word, I want to create a list C, which should contain only those words that have the same spelling in E and in R. For example, a word 'cop' must be in C, because it is in the list R as well as in E.
Is there any polynomial way to do it?
P.S.: One important note: because of the different character encodings, there are two L lists, LE for English letters and LR for Russian, but the appearance of their letters is the same:
LE='acekopuxy'
LR='асекориху'
Upvotes: 0
Views: 380
Reputation: 363597
Eset = set(E)
C = [w for w in R if w.replace(LR,LE) in Eset]
Not sure if I understood the problem correctly, but assuming good hashing, this runs in O(n).
Upvotes: 1
Reputation: 336178
You can use regular expressions for this:
^[acekopuxy]+$
will match words that contain only those characters.
import re
regex = re.compile(r"^[acekopuxy]+$", re.I)
output = []
for word in mylist:
if regex.match(word):
output.append(word)
You'll need to do this for both lists, using the correct encodings. That means that for the Russian list, you'll need to use the equivalent characters, like ^[\u0441\u1234...]$
.
Then, if you want to find the words that "look the same", you could use a translation table to convert the words in one of the list into the format of the other list, then convert the lists to sets, and check their intersection.
Upvotes: 1
Reputation: 9492
You can use sets for this:
english_set = set(E)
russian_set = set(R)
common_words = english_set.intersection(russian_set)
I'm not sure I got the encoding part right though, but if that means letters that look similar are actually different bytes, you can for example prepare the russian list by replacing these letters by their english counterpart prior to doing the intersection.
Upvotes: 3
Reputation: 82734
You need to tell the program yourself, which characters are similar. Since they are each different Unicode codepoints, you will have to have a mapping like this:
var RE_map = (
(u'c', u'\u0441'),
# ...and so on
)
Then, translate all words from R to their E representation:
for ec, rc in RE_map:
string = string.replace(rc, ec)
and finally check, if the string is now in E
:
if string in E:
print "The word exists of characters similar in Latin and Cyrillic."
Upvotes: 1