Reputation: 731
I want to generate IDs for strings that are being read from a text file. If the strings are duplicates, I want the first instance of the string to have an ID containing 6 characters. For the duplicates of that string, I want the ID to be the same as the original one, but with an additional two characters. I'm having trouble with the logic. Here's what I've done so far:
from itertools import groupby
import uuid
f = open('test.txt', 'r')
addresses = f.readlines()
list_of_addresses = ['Address']
list_of_ids = ['ID']
for x in addresses:
list_of_addresses.append(x)
def find_duplicates():
for x, y in groupby(sorted(list_of_addresses)):
id = str(uuid.uuid4().get_hex().upper()[0:6])
j = len(list(y))
if j > 1:
print str(j) + " instances of " + x
list_of_ids.append(id)
print list_of_ids
find_duplicates()
How should I approach this?
Edit: here's the contents of test.txt
:
123 Test
123 Test
123 Test
321 Test
567 Test
567 Test
And the output:
3 occurences of 123 Test
['ID', 'C10DD8']
['ID', 'C10DD8']
2 occurences of 567 Test
['ID', 'C10DD8', '595C5E']
['ID', 'C10DD8', '595C5E']
Upvotes: 1
Views: 1663
Reputation: 44485
If the strings are duplicates, I want the first instance of the string to have an ID containing 6 characters. For the duplicates of that string, I want the ID to be the same as the original one, but with an additional two characters.
Try using a collections.defaultdict
.
Given
import ctypes
import collections as ct
filename = "test.txt"
def read_file(fname):
"""Read lines from a file."""
with open(fname, "r") as f:
for line in f:
yield line.strip()
Code
dd = ct.defaultdict(list)
for x in read_file(filename):
key = str(ctypes.c_size_t(hash(x)).value) # make positive hashes
if key[:6] not in dd:
dd[key[:6]].append(x)
else:
dd[key[:8]].append(x)
dd
Output
defaultdict(list,
{'133259': ['123 Test'],
'13325942': ['123 Test', '123 Test'],
'210763': ['567 Test'],
'21076377': ['567 Test'],
'240895': ['321 Test']})
The resulting dictionary has keys (of length 6) for every first occurrence of a unique line. For every successive replicate line, two additional characters are sliced for the key.
You can implement the keys however you wish. In this case, we used hash()
to correlate the key to each unique line. We then sliced the desired sequence from the key. See also a post on making positive hash values from ctypes
.
To inspect your results, create the appropriate lookup dictionaries from the defaultdict
.
# Lookups
occurrences = ct.defaultdict(int)
ids = ct.defaultdict(list)
for k, v in dd.items():
key = v[0]
occurrences[key] += len(v)
ids[key].append(k)
# View data
for k, v in occurrences.items():
print("{} instances of {}".format(v, k))
print("IDs:", ids[k])
print()
Output
1 instances of 321 Test
IDs: ['240895']
2 instances of 567 Test
IDs: ['21076377', '210763']
3 instances of 123 Test
IDs: ['13325942', '133259']
Upvotes: 1
Reputation: 12669
Your question is little confusing, I don't get what is criteria to generate id , here i am showing you just logic not exact solution, You can take help from logic
track={}
with open('file.txt') as f:
for line_no,line in enumerate(f):
if line.split()[0] not in track:
track[line.split()[0]]=[['ID','your_unique_id']]
else:
#here put your logic what you want to append if id is dublicate
track[line.split()[0]].append(['ID','dublicate_id'+str(line_no)])
print(track)
output:
{'123': [['ID', 'your_unique_id'], ['ID', 'dublicate_id1'], ['ID', 'dublicate_id2']], '321': [['ID', 'your_unique_id']], '567': [['ID', 'your_unique_id'], ['ID', 'dublicate_id5']]}
Upvotes: 0