James Chang
James Chang

Reputation: 630

How to make an efficient string filter in Python?

I have two list objects: wiki_text and corpus. wiki_text is made up of small phrases and corpus is made up of long sentences.

wiki_text = ['never ending song of love - ns.jpg',
 'ecclesiological society',
 "1955-56 michigan wolverines men's basketball team",
 'sphinx strix',
 'petlas',
 '1966 mlb draft',
 ...]

corpus = ['Substantial progress has been made in the last twenty years',
          'Patients are at risk for prostate cancer.',...]

My goal is to create a filter which can filter out elements in wiki_text that is a substring of the elements in corpus. For example, if 'ecclesiological society' exists as part of a sentence in corpus, it should be kept in the final result. The final result should be a subset of the original wiki_text. The following code is what I used before:

def wiki_filter(wiki_text, corpus):
    result = []
    for i in wiki_text:
        for e in corpus:
            if i in e:
                result.append(i)
                break
    return result

However, given the length of wiki_text and corpus (each > 10 million). This function took extremely long hours to run. Is there any better way to solve this problem?

Upvotes: 0

Views: 483

Answers (3)

Onyambu
Onyambu

Reputation: 79238

How fact can the regex engine work here?. You can try

import re
re.findall('|'.join(wiki_text),'\n'.join(corpus))

Upvotes: 0

Teodor Marinescu
Teodor Marinescu

Reputation: 186

To make it really fast I would suggest a bit of an unorthodox approach namely using Lucene (PyLucene if you are forced to only use python).

Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. PyLucene is a Python extension for accessing Java LuceneTM. Its goal is to allow you to use Lucene's text indexing and searching capabilities from Python.

This is how I would do it: Index the corpus sentences, each sentence in separate record. Then using the search functionality of Lucene, search for each of the phrases in your wiki_text using a string query.

Now, this approach is not the easiest and most straight forward one to use but it would be one of the fastest in my opinion. You would probably be able to perform millions of searches (wiki_text phrases) in million of records (corpus) in a matter of minutes. So if @coldspeed 's flashtext solution satisfies your needs, go for it, otherwise, give Lucene a try!

Upvotes: 0

cs95
cs95

Reputation: 402593

Let's see if flashtext can help here.

First, pip install flashtext, and then build a KeywordProcessor object and call extract_keywords to filter out your strings.

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
for w in wiki_text:
    keyword_processor.add_keyword(w)

filtered_corpus = [c for c in corpus if keyword_processor.extract_keywords(c)]

Unfortunately, the flashtext API doesn't yet have a "has_keyword" method, so you will need to test the truthiness of the temp list that extract_keywords returns and then subsequently discard it. If you're upto it, you can contribute to the project on GitHub.

Upvotes: 2

Related Questions