Yunfei Yang
Yunfei Yang

Reputation: 21

An efficient way to find elements of a list that contain substrings from another list

list1 = ["happy new year", "game over", "a happy story", "hold on"]
list2 = ["happy", "new", "hold"]

Assume I have two string lists, I want to use a new list to store the matched pairs of those two lists just like below:

list3=[["happy new year","happy"],["happy new year","new"],["a happy story","happy"],["hold on","hold"]]

which means I need to get all pairs of strings in one list with their substrings in another list.

Actually that is about some Chinese ancient scripts data. The first list contains names of people in 10th to 13th century, and the second list contains titles of all the poems at that period. Ancient Chinese people often record their social relations in the title of their works. For example, someone may write a poem titled "For my friend Wang Anshi". In this case, the people "Wang Anshi" in the first list should be matched with this title. Also their are cases like "For my friend Wang Anshi and Su Shi" which contains more than one people in the title. So basically that's a huge work involved 30,000 people and 160,000 poems.

Following is my code:

list3 = []

for i in list1:
        for j in list2:
            if str(i).count(str(j)) > 0:
                list3.append([i,j])

I use str(i) because python always takes my Chinese strings as float. And this code does work but too too too slow. I must figure out another way to do that. Thanks!

Upvotes: 2

Views: 1657

Answers (3)

KingOtto
KingOtto

Reputation: 1483

This can be done quite easily in an absolutely strightforward way.

Option A: Finding "all" possible combinations: To find all strings in one list that contain substrings from another list, loop over all your strings of your list1 (the strings to assess) and for each element check whether it contains a substring of list2:

list1 = ["happy new year", "game over", "a happy story", "hold on"]
list2 = ["happy", "new", "hold"]
[(string, substring) for string in list1 for substring in list2 if substring in string]
>>> [('happy new year', 'happy'), ('happy new year', 'new'), ('a happy story', 'happy'), ('hold on', 'hold')]

(I do think the title of your question is a bit misleading, though, as you are not only asking for elements of a list that contain a substring of another list, but as per your code example you are looking for 'all possible combinations'.)

Thus option B: Finding "any" combination: Much simpler and faster, if you really only need what the question says, you can improve performance by finding only the 'any' matches:

[string for string in list1 if ( substring in string for substring in list2)]

Option B will also allow you to improve performance. In case the lists are very long, you can run B first, create a subset (only strings that will actually produce a match with a substring), and then expand again to catch 'all' instead of any.

Upvotes: 0

tobias_k
tobias_k

Reputation: 82889

If the lists are longer, it might be worth building a sort of "index" of the sentences a given word appears in. Creating the index takes about as long as finding the first word from list2 in all the sentences in list1 (it has to loop over all the words in all the sentences), and once created, you can get the sentences containing a word much faster in O(1).

list1 = ["happy new year", "game over", "a happy story", "hold on"]    
list2 = ["happy", "new", "hold"]

import collections    
index = collections.defaultdict(list)

for sentence in list1:
    for word in sentence.split():
        index[word].append(sentence)

res = [[sentence, word] for word in list2 for sentence in index[word]]

Result:

[['happy new year', 'happy'],
 ['a happy story', 'happy'],
 ['happy new year', 'new'],
 ['hold on', 'hold']]

This uses str.split to split the words at spaces, but if the sentences are more complex, e.g. if they contain punctuation, you might use a regular expression with word boundaries \b instead, and possibly normalize the sentences (e.g. convert to lowercase or apply a stemmer, not sure if this is applicable to Chinese, though).

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1121246

Use a regular expression to do the searching, via the re module. A regular expression engine can work out matching elements in a search through text much better than a nested for loop can.

I'm going to use better variable names here to make it clearer where what list has to go; titles are the poem titles you are searching through, and names the things you are trying to match. matched are the (title, name) pairs you want to produce:

import re

titles = ["happy new year", "game over", "a happy story", "hold on"]
names = ["happy", "new", "hold"]

by_reverse_length = sorted(names, key=len, reverse=True)
pattern = "|".join(map(re.escape, by_reverse_length))
any_name = re.compile("({})".format(pattern))
matches = []

for title in titles:
    for match in any_name.finditer(title):
        matches.append((title, match.group()))

The above produces your required output:

>>> matches
[('happy new year', 'happy'), ('happy new year', 'new'), ('a happy story', 'happy'), ('hold on', 'hold')]

The names are sorted by length, in reverse, so that longer names are found before shorter with the same prefix; e.g. Hollander is found before Holland is found before Holl.

The pattern string is created from your names to form a ...|...|... alternatives pattern, any one of those patterns can match, but the regex engine will find those listed earlier in the sequence over those put later, hence the need to reverse sort by length. The (...) parentheses around the whole pattern of names tells the regular expression engine to capture that part of the text, in a group. The match.group() call in the loop can then extract the matched text.

The re.escape() function call is there to prevent 'meta characters' in the names, characters with special meaning such as ^, $, (, ), etc, from being interpreted as their special regular expression meanings.

The re.finditer() function (and method on compiled patterns) then finds non-overlapping matches in order from left to right, so it'll never match shorter substrings, and gives us the opportunity to extract the match object for each. This gives you more options if you want to know about starting positions of the matches and other metadata as well, should you want those. Otherwise, re.findall() could also be used here.

If you are going to use the above on text with Western alphabets and not on Chinese, then you probably also want to add word boundary markers, \b:

any_name = re.compile("\b({})\b".format(pattern))

otherwise substrings part of a larger word can be matched. Since Chinese has no word boundary characters (such as spaces and punctuation) you don't want to use \b in such texts.

Upvotes: 2

Related Questions