Houdini
Houdini

Reputation: 3542

How can I tell if a string pattern exists within any element of a set in Python?

How can I ask if a string pattern, in this case C, exists within any element of this set without removing them each and looking at them?

This test fails, and I am not sure why. My guess is that Python is checking if any element in the set is C, instead of if any element contains C:

n [1]: seto = set()

In [2]: seto.add('C123.45.32')

In [3]: seto.add('C2345.345.32')

In [4]: 'C' in seto
Out[4]: False

I know that I can iterate them set to make this check:

In [11]: for x in seto:
    if 'C' in x:
        print(x)
   ....:         
C2345.345.32
C123.45.32

But that is not what I am looking to do in this case. Ok thanks for the help!

Edit

I am sorry, these are set operations, not list as my original post implied.

Upvotes: 1

Views: 680

Answers (3)

kindall
kindall

Reputation: 184280

The other solutions you've been given are correct, understandable, and good Python, and they are reasonably performant if your set is small.

It is, however, possible to do what you want (at, of course, a considerable overhead in memory and setup time; TANSTAAFL) much more quickly using an index. And this index maintains constant performance no matter how big your data gets (assuming you have enough memory to hold it all). If you're doing a lot of looking up, this can make your script a lot faster. And the memory isn't as bad as it could be...

We'll build a dict in which the keys are every possible substring from the items in the index, and the values are a set of the items that contain that substring.

from collections import defaultdict

class substring_index(defaultdict):

    def __init__(self, seq=()):
        defaultdict.__init__(self, set)
        for item in seq:
            self.add(item)

    def add(self, item):
        assert isinstance(item, str)   # requires strings
        if item not in self[item]:     # performance optimization for duplicates
            size = len(item) + 1
            for chunk in range(1, size):
                for start in range(0, size-chunk):
                    self[item[start:start+chunk]].add(item)

seto = substring_index()
seto.add('C123.45.32')
seto.add('C2345.345.32')

print(len(seto))      # 97 entries for 2 items, I wasn't kidding about the memory

Now you can easily (and instantly) test to see whether any substring is in the index:

print('C' in seto)    # True

Or you can easily find all strings that contain a particular substring:

print(seto['C'])      # set(['C2345.345.32', 'C123.45.32'])

This can be pretty easily extended to include "starts with" and "ends with" matches, too, or to be case-insensitive.

For a less memory-intensive version of the same idea, look into tries.

Upvotes: 1

Eric
Eric

Reputation: 97631

Taking John's answer one stage further, if you want to use the subset of items containing C:

items_with_c = {item for item in seto if 'C' in item}
if items_with_c:
    do_something_with(items_with_c)
else:
    print "No items contain C"

Upvotes: 2

John Kugelman
John Kugelman

Reputation: 361869

'C' in seto

This checks to see if any of the members of seto is the exact string 'S'. Not a substring, but exactly that string. To check for a substring, you'll want to iterate over the set and perform a check on each item.

any('C' in item for item in seto)

The exact nature of the test can be easily changed. For instance, if you want to be stricter about where C can appear:

any(item.startswith('C') for item in seto)

Upvotes: 3

Related Questions