Reputation: 77
I am searching for matches in large text file, but I find it way too slow. This is the file structure:
word1 5752
word2 96332
word3 137
I am trying to match text in first column, and I want to extract the value in second column. The columns are separated by \t and there are about 10 million lines. The file is searched many times with different words. What method of searching has the best time efficiency?
EDIT: The file is 129 Mb and will be searched thousands of times at least. EDIT2: File is alphabetically sorted and words can occur multiple times only if they have different capital letters e.g: Word WORD word WOrd will all be different entries.
Upvotes: 0
Views: 3035
Reputation: 21
If you store your data in a Hash Table (a python dictionary structure) it would be very quick to do this operation. Your 'Key' is the name, each Key has a 'Value', the number. This code shown below utilizes the hash for a faster data retrieval:
yourDict = {'name0':number0,'name1':number1,...,'nameN':numberN}
if 'checkName' in yourDict:
#It exists!
theNumber = yourDict['checkName']
else:
#It doesn't exist :/
*Note: if you use:
if 'checkName' in yourDict.keys():
you are actually creating a list of keys and then searching through them. This operation doesn't utilize the Hash Table (much slower).
This is a bit on how HandTable Data Structures work: https://www.youtube.com/watch?v=MfhjkfocRR0
This is an answer showing that a dictionary in python acts like a Hash Table: Is a Python dictionary an example of a hash table?
Upvotes: 2
Reputation: 1367
Is this for an assignment or for work/project? I don't know how people feel about re-implementing core algorithms, but you how big is your text file?
An alternative approach using Pandas for ease of use and underlying optimization:
In [61]: df = pd.read_csv(r'C:\temp\data.txt', header=None, sep=' ')
In [62]: df
Out[62]:
0 1
0 word1 5752
1 word2 96332
2 word3 137
In [63]: df[df[0] == 'word2']
Out[63]:
0 1
1 word2 96332
In [64]: df[df[0] == 'word2'][1]
Out[64]:
1 96332
Name: 1, dtype: int64
2 questions for you:
1) Can this be held in memory instead of being re-loaded every time? (maybe with a TTL of like an hour?)
2) Is your file sorted? I believe Binary Search needs to have sorted data first. What would the impact to performance be to sort every time you have to read the data?
Upvotes: 1
Reputation: 185
with open('myfile.dat','r') as src:
mapping = dict((line.strip().split('\t') for line in src if line))
Depending on the size of the file and memory, this could be a solution. If you have to perform this kind of search algorithm more than once during a run of your program.
Upvotes: 2
Reputation: 837
I would first sort the file alphabetically and then perform a logarithmic search (https://en.wikipedia.org/wiki/Binary_search_algorithm). You have a nice example on how to do it with python here: http://programarcadegames.com/index.php?chapter=searching&lang=en#section_16.5
Upvotes: 0