Amin.A
Amin.A

Reputation: 361

Regex: Matching two regular expressions

Assume we have collection of strings as input:

str1: ABCD
str2: ABCD
str3: AWXYD
str4: AWXYD

The goal is to remove duplicates and preserve the unique ones. Having the above input, our output should look like the following:

ABCD
AWXYD

Unfortunately, the machine that produces this collection is error prone and sometimes misses some alphabets (see below). Luckily, we are able detect that there is a missing part. But we don’t know how large this missing part is. In reality we have the following input:

str1: A?CD
str2: AB?D
str3: AWXYD
str4: A?D

where ? indicates the missing part.

In this example, we would like to preserve A?CD or AB?D and also AWXYD.

In an attempt to solve this, I substituted ? with .* and assumed these strings to be regular expressions:

Reg1 --> A.*CD
Reg2 --> AB.*D
Reg3 --> AWXYD
Reg4 --> A.*D

Now I am trying to identify the duplicates by comparing regular expressions. Using this approach, one can easily match Reg4 to Reg3 because Reg3 is actually a string (no missing part). Things become complicated when both have missing parts and therefore you have to compare regular expressions.

I wonder if it is possible or if there is a better solution for this.

Thanks!

Edit 1: Note that str1 and str2 might came from different strings (e.g. AXCD and ABXD). Our goal is to remove any (possible) duplicates and be sure that preserved strings are unique (even if we remove more). This is why we preserve either str1 or str2. Thanks to Aaron who pointed this out.

Edit 2: There are millions of strings. That's why an algorithms is needed.

Upvotes: 3

Views: 1135

Answers (2)

callOfCode
callOfCode

Reputation: 1026

I don't think regular expressions is appropriate for such task. If you are asking if there is an implemented way to compare regular expressions, answer is NO. At least I haven't seen it. If you are asking if there is a way to implement it, I would say YES. You can represent regex as finite state machine as well as graph. And it is possible to check isomorphism of these things. But it would be enormously complex to do it for regular expressions. Three things that pops in my mind now is: Levenshtein distance algorithm , Binary search tree (extremely search efficient data structure) and Black Board architecture . And also here you will find several answers that can help you. Good luck!

P.S PostgreSQL has fuzzystrmatch module with Levenshtein algorithm implementation.

Upvotes: 2

fronthem
fronthem

Reputation: 4139

I think the problem is your pattern.

Reg1 --> A.*CD

and

Reg2 --> AB.*D

Sometimes, they are represented the same pattern e.g.

ABCD

Either Reg1 or Reg2 can be matched this text. That means there is some duplicated pattern inside Reg1 and Reg2.

You might solve your problem by changing your pattern to

Reg1 --> A(?!B).*CD
// (?!B) means the second character can be any letters except `B`

and

Reg2 --> A.*(?<!C)D
// (?<!C) means the second last character can be any letters except `C`

Otherwise you can not distinguish between these two patterns.

Upvotes: 0

Related Questions