Reputation: 2087
I am working on a project where I'm reading as many as 250000 items or more in a list and converting each of it's entries as key to a hash table.
sample_key = open("sample_file.txt").readlines()
sample_counter = [0] * (len(sample_key))
sample_hash = {sample.replace('\n', ''):counter for sample, counter in zip(sample_key, sample_counter)}
This code works well when len(sample_key)
is in the range 1000-2000. Beyound that it simply ignores processing any further data.
Any suggestions, how can I handle this large list data?
PS: Also, If there is an optimal way to perform this task(like reading directly as a hash key entry), then please suggest. I'm new to Python.
Upvotes: 3
Views: 5944
Reputation: 109510
Your text file can have duplicates which will overwrite existing keys in your dictionary (the python name for a hash table). You can create a unique set of your keys, and then use a dictionary comprehension to populate the dictionary.
sample_file.txt
a
b
c
c
Python code
with open("sample_file.txt") as f:
keys = set(line.strip() for line in f.readlines())
my_dict = {key: 1 for key in keys if key}
>>> my_dict
{'a': 1, 'b': 1, 'c': 1}
Here is an implementation with 1 million random alpha characters of length 10. The timing is relatively trivial at under half a second.
import string
import numpy as np
letter_map = {n: letter for n, letter in enumerate(string.ascii_lowercase, 1)}
long_alpha_list = ["".join([letter_map[number] for number in row]) + "\n"
for row in np.random.random_integers(1, 26, (1000000, 10))]
>>> long_alpha_list[:5]
['mfeeidurfc\n',
'njbfzpunzi\n',
'yrazcjnegf\n',
'wpuxpaqhhs\n',
'fpncybprrn\n']
>>> len(long_alpha_list)
1000000
# Write list to file.
with open('sample_file.txt', 'wb') as f:
f.writelines(long_alpha_list)
# Read them back into a dictionary per the method above.
with open("sample_file.txt") as f:
keys = set(line.strip() for line in f.readlines())
>>> %%timeit -n 10
>>> my_dict = {key: 1 for key in keys if key}
10 loops, best of 3: 379 ms per loop
Upvotes: 6