Abhishek J
Abhishek J

Reputation: 2584

Regex subsequence matching

I'm using python but code in any language will do as well for this question.

Suppose I have 2 strings.

sequence ='abcd'
string = 'axyzbdclkd'

In the above example sequence is a subsequence of string

How can I check if sequence is a subsequence of string using regex? Also check the examples here for difference in subsequence and subarray and what I mean by subsequence.

The only think I could think of is this but it's far from what I want.

import re
c = re.compile('abcd')
c.match('axyzbdclkd')

Upvotes: 7

Views: 3868

Answers (3)

inf3rno
inf3rno

Reputation: 26139

I don't think the solution is as simple as @schwobaseggl claims. Let me show you another sequence from your database: ab1b2cd. By using the abcd subsequence for pattern matching you can get 2 results: ab(1b2)cd and a(b1)b(2)cd. So for testing purposes the proposed ^.*a.*b.*c.*d.*$ is ok(ish), but for parsing the ^a(.*)b(.*)cd$ will always be greedy. To get the second result you'll need to make it lazy: ^a(.*?)b(.*)cd$. So if you need this for parsing, then you should know how many variables are expected and to optimize the regex pattern you need to parse a few example strings and put the gaps with capturing groups only to the positions you really need them. An advanced version of this would inject the pattern of the actual variable instead of .*, so for example ^ab(\d\w\d)cd$ or ^a(\w\d)b(\d)cd$ in the second case.

Upvotes: 0

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 476594

You can, for an arbitrary sequence construct a regex like:

import re

sequence = 'abcd'
rgx = re.compile('.*'.join(re.escape(x) for x in sequence))

which will - for 'abcd' result in a regex 'a.*b.*c.*d'. You can then use re.find(..):

the_string = 'axyzbdclkd'
if rgx.search(the_string):
    # ... the sequence is a subsequence.
    pass

By using re.escape(..) you know for sure that for instance '.' in the original sequence will be translated to '\.' and thus not match any character.

Upvotes: 3

user2390182
user2390182

Reputation: 73460

Just allow arbitrary strings in between:

c = re.compile('.*a.*b.*c.*d.*')
# .* any character, zero or more times

Upvotes: 9

Related Questions