Reputation: 169
I have list of lists with pairs of words and want to depict words on ids. Ids should be from 0 till the len(set(words)). The list now looks like that:
[['pluripotent', 'Scharte'],
['Halswirbel', 'präventiv'],
['Kleiber', 'Blauspecht'],
['Kleiber', 'Scheidung'],
['Nillenlutscher', 'Salzstangenlecker']]
The result should have the same formats, but with ids instead. So for example:
[[0, 1],
[2, 3],
[4, 5],
[4, 6],
[7, 8]]
I have till now this, but it doesn't give me the right output:
def words_to_ids(labels):
vocabulary = []
word_to_id = {}
ids = []
for word1,word2 in labels:
vocabulary.append(word1)
vocabulary.append(word2)
for i, word in enumerate(vocabulary):
word_to_id [word] = i
for word1,word2 in labels:
ids.append([word_to_id [word1], word_to_id [word1]])
print(ids)
Output:
[[0, 0], [2, 2], [6, 6], [6, 6], [8, 8]]
It is repeating ids where there are unique words.
Upvotes: 1
Views: 615
Reputation: 1121834
You have two errors. First, you have a simple typo, here:
for word1,word2 in labels:
ids.append([word_to_id [word1], word_to_id [word1]])
You are adding the id for word1
twice, there. Correct the second word1
to look up word2
instead.
Next, you are not testing if you have seen a word before, so for 'Kleiber'
you first give it the id 4
, then overwrite that entry with 6
the next iteration. You need to give unique words numbers, not all words:
counter = 0
for word in vocabulary:
if word not in word_to_id:
word_to_id[word] = counter
counter += 1
or you could simply not add a word to vocabulary
if you already have that word listed. You don't really need a separate vocabulary
list here, by the way. A separate loop doesn't buy you anything, so the following works too:
word_to_id = {}
counter = 0
for words in labels:
for word in words:
word_to_id [word] = counter
counter += 1
You can simplify your code quite a bit by using a defaultdict
object and itertools.count()
to supply default values:
from collections import defaultdict
from itertools import count
def words_to_ids(labels):
word_ids = defaultdict(count().__next__)
return [[word_ids[w1], word_ids[w2]] for w1, w2 in labels]
The count()
object gives you the next integer value in a series each time __next__
is called, and defaultdict()
will call that each time you try to access a key that doesn't yet exist in the dictionary. Together, they ensure a unique ID for each unique word.
Upvotes: 2
Reputation: 164673
There are two issues:
word1
in word_to_id
.word_to_id
dictionary you need to consider unique values only.For example, in Python 3.7+ you can take advantage of insertion-ordered dictionaries:
for i, word in enumerate(dict.fromkeys(vocabulary)):
word_to_id[word] = i
for word1, word2 in labels:
ids.append([word_to_id[word1], word_to_id[word2]])
An alternative for versions pre-3.7 is to use collections.OrderedDict
or the itertools
unique_everseen
recipe.
If there is no ordering requirement, you can just use set(vocabulary)
.
Upvotes: 1