Rookie
Rookie

Reputation: 429

Get the whole word if a few characters present in Text File Content

I have text file which contains more than 100 paragraph. I want to find & list the words that contain a specific string.

This is my text file content:

A computer is a general purpose device that can be programmed to carry out a set of arithmetic or logical operations automatically. Since a sequence of operations can be readily changed, the computer can solve more than one kind of problem.

I want to retrieve the words that contains ra. And it should return general, programmed & operations.

Here is my code:

with open('computer.txt', 'r') as searchfile:
    for line in searchfile:
        if "ra" in line:
            line_split = line.split(' ')
            for each in line_split:
                if "ra" in each:
                    print each

What would be the most efficient method to do this ?

Upvotes: 2

Views: 881

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180401

Your code can be reduced to:

with open('computer.txt', 'r') as f:
  print [word for word in f.read().split() if "ra" in word]
  ['general', 'programmed', 'operations', 'operations']

Timings on a file with 100 paragraphs:

In [7]: %%timeit
with open('computer.txt', 'r') as f:
    r = re.compile(r"\b\w*ra\w*\b")
    r.findall(f.read())
   ...: 
100 loops, best of 3: 2.82 ms per loop

In [8]: %%timeit
with open('computer.txt', 'r') as f:
      [word for word in f.read().split() if "ra" in word]
   ...: 
1000 loops, best of 3: 1.35 ms per loop

Or use string.translate to differentiate between operations and operations. etc..:

In [18]: %%timeit
with open('out.txt', 'r') as f:
    lines = [word.translate(None,  string.punctuation) for word in f.read().split() if "ra" in word]
   ....: 
100 loops, best of 3: 2.13 ms per loop

In [19]: %%timeit
with open('out.txt', 'r') as f:
    r = re.compile(r"\b\w*ra\w*\b")    r.findall(f.read())
   ....: 
100 loops, best of 3: 3.53 ms per loop

Upvotes: 1

Tim Pietzcker
Tim Pietzcker

Reputation: 336138

A regular expression would work nicely here:

>>> import re
>>> r = re.compile(r"\b\w*ra\w*\b")
>>> r.findall("A computer is a general purpose device that can be programmed to carry out a set of arithmetic or logical operations automatically. Since a sequence of operations can be readily changed, the computer can solve more than one kind of problem.")
['general', 'programmed', 'operations', 'operations']

This list contains duplicates which can be removed via a simple set() call (which in turn removes the order of the elements, so if you need to preserve that, a bit more work is necessary).

Note that the regex is rather naive in what it considers a "word":

\b   # Start of an alphanumeric word
\w*  # Match any number of word characters [A-Za-z0-9_]
ra   # Match ra
\w*  # Match any number of word characters
\b   # End of a word

Upvotes: 2

Related Questions