Reputation: 630
I have two list objects: wiki_text and corpus. wiki_text is made up of small phrases and corpus is made up of long sentences.
wiki_text = ['never ending song of love - ns.jpg',
'ecclesiological society',
"1955-56 michigan wolverines men's basketball team",
'sphinx strix',
'petlas',
'1966 mlb draft',
...]
corpus = ['Substantial progress has been made in the last twenty years',
'Patients are at risk for prostate cancer.',...]
My goal is to create a filter which can filter out elements in wiki_text that is a substring of the elements in corpus. For example, if 'ecclesiological society' exists as part of a sentence in corpus, it should be kept in the final result. The final result should be a subset of the original wiki_text. The following code is what I used before:
def wiki_filter(wiki_text, corpus):
result = []
for i in wiki_text:
for e in corpus:
if i in e:
result.append(i)
break
return result
However, given the length of wiki_text and corpus (each > 10 million). This function took extremely long hours to run. Is there any better way to solve this problem?
Upvotes: 0
Views: 483
Reputation: 79238
How fact can the regex engine work here?. You can try
import re
re.findall('|'.join(wiki_text),'\n'.join(corpus))
Upvotes: 0
Reputation: 186
To make it really fast I would suggest a bit of an unorthodox approach namely using Lucene (PyLucene if you are forced to only use python).
Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. PyLucene is a Python extension for accessing Java LuceneTM. Its goal is to allow you to use Lucene's text indexing and searching capabilities from Python.
This is how I would do it: Index the corpus sentences, each sentence in separate record. Then using the search functionality of Lucene, search for each of the phrases in your wiki_text using a string query.
Now, this approach is not the easiest and most straight forward one to use but it would be one of the fastest in my opinion. You would probably be able to perform millions of searches (wiki_text phrases) in million of records (corpus) in a matter of minutes. So if @coldspeed 's flashtext solution satisfies your needs, go for it, otherwise, give Lucene a try!
Upvotes: 0
Reputation: 402593
Let's see if flashtext
can help here.
First, pip install flashtext
, and then build a KeywordProcessor
object and call extract_keywords
to filter out your strings.
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
for w in wiki_text:
keyword_processor.add_keyword(w)
filtered_corpus = [c for c in corpus if keyword_processor.extract_keywords(c)]
Unfortunately, the flashtext
API doesn't yet have a "has_keyword
" method, so you will need to test the truthiness of the temp list that extract_keywords
returns and then subsequently discard it. If you're upto it, you can contribute to the project on GitHub.
Upvotes: 2