Produce unique ids for a list of lists with words

Question

I have list of lists with pairs of words and want to depict words on ids. Ids should be from 0 till the len(set(words)). The list now looks like that:

[['pluripotent', 'Scharte'],
 ['Halswirbel', 'präventiv'],
 ['Kleiber', 'Blauspecht'],
 ['Kleiber', 'Scheidung'],
 ['Nillenlutscher', 'Salzstangenlecker']]

The result should have the same formats, but with ids instead. So for example:

[[0, 1],
 [2, 3],
 [4, 5],
 [4, 6],
 [7, 8]]

I have till now this, but it doesn't give me the right output:

def words_to_ids(labels):
  vocabulary = []
  word_to_id = {}
  ids = []
  for word1,word2 in labels:
      vocabulary.append(word1)
      vocabulary.append(word2)

  for i, word in enumerate(vocabulary):
      word_to_id [word] = i
  for word1,word2 in labels:
      ids.append([word_to_id [word1], word_to_id [word1]])
  print(ids)

Output:

[[0, 0], [2, 2], [6, 6], [6, 6], [8, 8]]

It is repeating ids where there are unique words.

Martijn Pieters · Accepted Answer

You have two errors. First, you have a simple typo, here:

for word1,word2 in labels:
    ids.append([word_to_id [word1], word_to_id [word1]])

You are adding the id for word1 twice, there. Correct the second word1 to look up word2 instead.

Next, you are not testing if you have seen a word before, so for 'Kleiber' you first give it the id 4, then overwrite that entry with 6 the next iteration. You need to give unique words numbers, not all words:

counter = 0
for word in vocabulary:
    if word not in word_to_id:
        word_to_id[word] = counter
        counter += 1

or you could simply not add a word to vocabulary if you already have that word listed. You don't really need a separate vocabulary list here, by the way. A separate loop doesn't buy you anything, so the following works too:

word_to_id = {}
counter = 0
for words in labels:
    for word in words:
        word_to_id [word] = counter
        counter += 1

You can simplify your code quite a bit by using a defaultdict object and itertools.count() to supply default values:

from collections import defaultdict
from itertools import count

def words_to_ids(labels):
    word_ids = defaultdict(count().__next__)
    return [[word_ids[w1], word_ids[w2]] for w1, w2 in labels]

The count() object gives you the next integer value in a series each time __next__ is called, and defaultdict() will call that each time you try to access a key that doesn't yet exist in the dictionary. Together, they ensure a unique ID for each unique word.

Produce unique ids for a list of lists with words

Answers (2)

Related Questions