user1843553
user1843553

Reputation: 41

Python: Intersection of full string from list with partial string

Let's say I have a string and a list of strings:

a = 'ABCDEFG'

b = ['ABC', 'QRS', 'AHQ']

How can I pull out the string in list b that matches up perfectly with a section of the string a? So the would return would be something like ['ABC']

The most important issue is that I have tens of millions of strings, so that time efficiency is essential.

Upvotes: 4

Views: 2494

Answers (5)

mgilson
mgilson

Reputation: 309949

If you only want the first match in b:

next((s for s in b if s in a), None)

This has the advantage of short-circuiting as soon as it finds a match whereas the other list solutions will keep going. If no match is found, it will return None.

Upvotes: 4

beoliver
beoliver

Reputation: 5759

You might want to have a look at the following algorithm:

Boyer–Moore string search algorithm And wikipedia

But without knowing more, this might be overkill!

Upvotes: 0

abarnert
abarnert

Reputation: 365787

Keep in mind that Python's substring search x in a is already optimized pretty well for the general case (and coded in C, for CPython), so you're unlikely to beat it in general, especially with pure Python code.

However, if you have a more specialized case, you can do much better.

For example, if you have an arbitrary list of millions of strings b that all need to be searched for within one giant static string a that never changes, preprocessing a can make a huge difference. (Note that this is the opposite of the usual case, where preprocessing the patterns is the key.)

On the other hand, if you expect matches to be unlikely, and you've got the whole b list in advance, you can probably get some large gains by organizing b in some way. For example, there's no point searching for "ABCD" if "ABC" already failed; if you need to search both "ABC" and "ABD" you can search for "AB" first and then check whether it's followed by "C" or "D" so you don't have to repeat yourself; etc. (It might even be possible to merge all of b into a single regular expression that's close enough to optimal… although with millions of elements, that probably isn't the answer.)

But it's hard to guess in advance, with no more information than you've given us, exactly what algorithm you want.

Wikipedia has a pretty good high-level overview of string searching algorithms. There's also a website devoted to pattern matching in general, which seems to be a bit out of date, but then I doubt you're going to turn out to need an algorithm invented in the past 3 years anyway.

Upvotes: 1

NPE
NPE

Reputation: 500447

In [3]: [s for s in b if s in a]
Out[3]: ['ABC']

On my machine this takes about 3 seconds when b contains 20,000,000 elements (tested with a and b containing strings similar to those in the question).

Upvotes: 0

Nix
Nix

Reputation: 58532

Answer:

(x for x in b if x in a )

That will return a generator that will be a list of ones that match. Take the first or loop over it.

Upvotes: 0

Related Questions