zzzbbx
zzzbbx

Reputation: 10161

Assign strings to IDs in Python

I am reading a text file with python, formatted where the values in each column may be numeric or strings.

When those values are strings, I need to assign a unique ID of that string (unique across all the strings under the same column; the same ID must be assigned if the same string appears elsewhere under the same column).

What would be an efficient way to do it?

Upvotes: 4

Views: 1684

Answers (3)

Greg Allen
Greg Allen

Reputation: 493

defaultdict answer updated for python 3, where .next is now .__next__, and for pylint compliance, where using "magic" __*__ methods is discouraged:

ids = collections.defaultdict(functoools.partial(next, itertools.count()))

Upvotes: 2

user2357112
user2357112

Reputation: 282138

Use a defaultdict with a default value factory that generates new ids:

ids = collections.defaultdict(itertools.count().next)
ids['a']  # 0
ids['b']  # 1
ids['a']  # 0

When you look up a key in a defaultdict, if it's not already present, the defaultdict calls a user-provided default value factory to get the value and stores it before returning it.

collections.count() creates an iterator that counts up from 0, so collections.count().next is a bound method that produces a new integer whenever you call it.

Combined, these tools produce a dict that returns a new integer whenever you look up something you've never looked up before.

Upvotes: 12

Burhan Khalid
Burhan Khalid

Reputation: 174758

Create a set, and then add strings to the set. This will ensure that strings are not duplicated; then you can use enumerate to get a unique id of each string. Use this ID when you are writing the file out again.

Here I am assuming the second column is the one you want to scan for text or integers.

seen = set()
with open('somefile.txt') as f:
   reader = csv.reader(f, delimiter=',')
   for row in reader:
      try:
         int(row[1])
      except ValueError:
         seen.add(row[1]) # adds string to set

# print the unique ids for each string

for id,text in enumerate(seen):
    print("{}: {}".format(id, text))

Now you can take the same logic, and replicate it across each column of your file. If you know the column length in advanced, you can have a list of sets. Suppose the file has three columns:

unique_strings = [set(), set(), set()]

with open('file.txt') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
       for column,value in enumerate(row):
           try:
               int(value)
           except ValueError:
               # It is not an integer, so it must be
               # a string
               unique_strings[column].add(value)

Upvotes: 0

Related Questions