Reputation: 429
I have text file which contains more than 100 paragraph. I want to find & list the words that contain a specific string.
This is my text file content:
A computer is a general purpose device that can be programmed to carry out a set of arithmetic or logical operations automatically. Since a sequence of operations can be readily changed, the computer can solve more than one kind of problem.
I want to retrieve the words that contains ra
. And it should return
general
, programmed
& operations
.
Here is my code:
with open('computer.txt', 'r') as searchfile:
for line in searchfile:
if "ra" in line:
line_split = line.split(' ')
for each in line_split:
if "ra" in each:
print each
What would be the most efficient method to do this ?
Upvotes: 2
Views: 881
Reputation: 180401
Your code can be reduced to:
with open('computer.txt', 'r') as f:
print [word for word in f.read().split() if "ra" in word]
['general', 'programmed', 'operations', 'operations']
Timings on a file with 100 paragraphs:
In [7]: %%timeit
with open('computer.txt', 'r') as f:
r = re.compile(r"\b\w*ra\w*\b")
r.findall(f.read())
...:
100 loops, best of 3: 2.82 ms per loop
In [8]: %%timeit
with open('computer.txt', 'r') as f:
[word for word in f.read().split() if "ra" in word]
...:
1000 loops, best of 3: 1.35 ms per loop
Or use string.translate to differentiate between operations
and operations.
etc..:
In [18]: %%timeit
with open('out.txt', 'r') as f:
lines = [word.translate(None, string.punctuation) for word in f.read().split() if "ra" in word]
....:
100 loops, best of 3: 2.13 ms per loop
In [19]: %%timeit
with open('out.txt', 'r') as f:
r = re.compile(r"\b\w*ra\w*\b") r.findall(f.read())
....:
100 loops, best of 3: 3.53 ms per loop
Upvotes: 1
Reputation: 336138
A regular expression would work nicely here:
>>> import re
>>> r = re.compile(r"\b\w*ra\w*\b")
>>> r.findall("A computer is a general purpose device that can be programmed to carry out a set of arithmetic or logical operations automatically. Since a sequence of operations can be readily changed, the computer can solve more than one kind of problem.")
['general', 'programmed', 'operations', 'operations']
This list contains duplicates which can be removed via a simple set()
call (which in turn removes the order of the elements, so if you need to preserve that, a bit more work is necessary).
Note that the regex is rather naive in what it considers a "word":
\b # Start of an alphanumeric word
\w* # Match any number of word characters [A-Za-z0-9_]
ra # Match ra
\w* # Match any number of word characters
\b # End of a word
Upvotes: 2