Reputation: 23
What are the "best" ways to search for a occurrence of a string in a large number of text files, using python?
As I understand it we can use the following:
for f in files:
with open("file.txt") as f:
for line in f:
# do stuff
Python caches the file in chunks under the hood and therefore the IO penalty is WAY less severe than it looks like at first glance. This is my go-to if I had to read a few files at most.
But I can also do the following in the case of a list of files(or os.walk):
for f in files:
with open("file.txt") as f:
lines = list(f)
for line in lines:
#do stuff
# Or a variation on this
If I have hundreds of files to read I'd like to load them all up into memory before scanning them. The logic here is to keep file access time to a minimum(and let the OS its filesystem magic) and keep the logic minimal it since IO is often the bottleneck. It's obviously going to cost way more memory, but will it improve performance?
Are my assumptions correct here and/or are there better ways of doing this? If there's no clear answer what would be the best way to measure this in python?
Upvotes: 0
Views: 1126
Reputation: 1894
is that premature optimization ?
did You actually profile the whole process, is there really a need to speed it up ? see: https://stackify.com/premature-optimization-evil/
if You really HAVE the need to speed it up, You should consider some threaded approach, since it is I/O bound.
one easy way is, to use ThreadPoolExecutor, see : https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor
another way (if You are on linux) is just to execute some shell command like 'find', 'grep' etc. - those little C-programs are highly optimized and will be for sure the fastest solution. You might use Python to wrap those commands.
Regexp is not faster, as @Abdul Rahman Ali stated incorrectly:
$ python -m timeit '"aaaa" in "bbbaaaaaabbb"'
10000000 loops, best of 3: 0.0767 usec per loop
$ python -m timeit -s 'import re; pattern = re.compile("aaaa")' 'pattern.search("bbbaaaaaabbb")'
1000000 loops, best of 3: 0.356 usec per loop
Upvotes: 1
Reputation: 1
The best way to search for a pattern in a text is to use Regular Expressions:
import re
f = open('folder.txt')
list_of_wanted_word=list()
for line in f:
wanted_word=re.findall('(^[a-z]+)',l) #find a text in a line and extract it
for k in wanted_word:#putting the word in a list
list_of_wanted_word.append(k)
print(list_of_wanted_word)
Upvotes: 0