Johnny
Johnny

Reputation: 127

Python : Fast search a valid substring for text from list of substrings

I need a fast and efficient method for searching pattern string from a list of many pattern strings which are valid substring of a string.

Conditions -

  1. I have a list of 100 pattern strings added in a particular sequence (known).
  2. The test case file is of size 35 GB and contains long strings in subsequent lines

Ask -

I have to traverse the file and for each line, I have to search for a matched pattern string that is a valid substring of the line (whichever comes first from the list of 100 pattern strings).

Example -

pattern_strings = ["earth is round and huge","earth is round", "mars is small"]

Testcase file contents - Among all the planets, the earth is round and mars is small.

..

..

Hence for the first line, the string at index 1 should qualify the condition.

Currently, I am trying to do a linear search -

def search(line,list_of_patterns):
    for pat in list_of_patterns:
        if pat in line:
            return pat
        else:
            continue
    return -1

The current run time is 21 minutes. The intent is to reduce it further. Need suggestions!

Upvotes: 0

Views: 170

Answers (1)

Attitude12136
Attitude12136

Reputation: 47

One trick I know of, though it has nothing to do with changing your existing code, is to try to run your code with PyPy rather than the standard CPython interpreter. That could be one trick that does significantly speed up execution.

https://www.pypy.org/features.html

As I have installed and used it myself, I can tell you know that installation is fairly simple.

This is one option if you do not want to change your code.

Another suggestion would be to time your code or use profilers to see where the bottleneck is and what is taking a relatively long amount of time.

Code-wise, you could avoid for loop and try these methods: https://betterprogramming.pub/how-to-replace-your-python-for-loops-with-map-filter-and-reduce-c1b5fa96f43a

A final option would be to write that piece of code in a faster more performant language such as C++ and call that .exe (if on Windows) from Python.

Upvotes: 1

Related Questions