Reputation: 3542
How can I ask if a string pattern, in this case C
, exists within any element of this set without removing them each and looking at them?
This test fails, and I am not sure why. My guess is that Python is checking if any element in the set is C
, instead of if any element contains C
:
n [1]: seto = set()
In [2]: seto.add('C123.45.32')
In [3]: seto.add('C2345.345.32')
In [4]: 'C' in seto
Out[4]: False
I know that I can iterate them set to make this check:
In [11]: for x in seto:
if 'C' in x:
print(x)
....:
C2345.345.32
C123.45.32
But that is not what I am looking to do in this case. Ok thanks for the help!
Edit
I am sorry, these are set operations, not list as my original post implied.
Upvotes: 1
Views: 680
Reputation: 184280
The other solutions you've been given are correct, understandable, and good Python, and they are reasonably performant if your set is small.
It is, however, possible to do what you want (at, of course, a considerable overhead in memory and setup time; TANSTAAFL) much more quickly using an index. And this index maintains constant performance no matter how big your data gets (assuming you have enough memory to hold it all). If you're doing a lot of looking up, this can make your script a lot faster. And the memory isn't as bad as it could be...
We'll build a dict
in which the keys are every possible substring from the items in the index, and the values are a set
of the items that contain that substring.
from collections import defaultdict
class substring_index(defaultdict):
def __init__(self, seq=()):
defaultdict.__init__(self, set)
for item in seq:
self.add(item)
def add(self, item):
assert isinstance(item, str) # requires strings
if item not in self[item]: # performance optimization for duplicates
size = len(item) + 1
for chunk in range(1, size):
for start in range(0, size-chunk):
self[item[start:start+chunk]].add(item)
seto = substring_index()
seto.add('C123.45.32')
seto.add('C2345.345.32')
print(len(seto)) # 97 entries for 2 items, I wasn't kidding about the memory
Now you can easily (and instantly) test to see whether any substring is in the index:
print('C' in seto) # True
Or you can easily find all strings that contain a particular substring:
print(seto['C']) # set(['C2345.345.32', 'C123.45.32'])
This can be pretty easily extended to include "starts with" and "ends with" matches, too, or to be case-insensitive.
For a less memory-intensive version of the same idea, look into tries.
Upvotes: 1
Reputation: 97631
Taking John's answer one stage further, if you want to use the subset of items containing C
:
items_with_c = {item for item in seto if 'C' in item}
if items_with_c:
do_something_with(items_with_c)
else:
print "No items contain C"
Upvotes: 2
Reputation: 361869
'C' in seto
This checks to see if any of the members of seto is the exact string 'S'
. Not a substring, but exactly that string. To check for a substring, you'll want to iterate over the set and perform a check on each item.
any('C' in item for item in seto)
The exact nature of the test can be easily changed. For instance, if you want to be stricter about where C
can appear:
any(item.startswith('C') for item in seto)
Upvotes: 3