Nithin
Nithin

Reputation: 73

Mysql Regular Expression search with no repeating characters

I have a database table with words from a dictionary.

Now I want to select words for an anagram. For example if I give the string SEPIAN it should fetch values like apes, pain, pains, pies, pines, sepia, etc.

For this I used the query

SELECT * FROM words WHERE word REGEXP '^[SEPIAN]{1,6}$'

But this query returns words like anna, essen which have repeated characters not in the supplied string. Eg. anna has two n's but there is only one n in the search string SEPIAN.

How can I write my regular expression to achieve this? Also if there are repeated characters in my search string at that time the repeated characters should reflect in the result.

Upvotes: 7

Views: 727

Answers (2)

Timur
Timur

Reputation: 6718

I guess something like this will help you. Table words:

| id    | word      | alfagram  |
---------------------------------
| 1     | karabar   | aaabkrr   |
| 2     | malabar   | aaablmr   |
| 3     | trantantan| aaannnrttt|

alfagram here is letters of a word in an alphabetical order.

PHP code:

$searchString = 'abrakadabra';
$searchStringAlfa = array();
for( $i=0,$c=strlen($searchString);$i<$c;$i++ ){
    if( isset($searchStringAlfa[$searchString[$i]]) ){
        $searchStringAlfa[$searchString[$i]]++;
    }else{
        $searchStringAlfa[$searchString[$i]] = 1;
    }
}
ksort($searchStringAlfa);
$regexp = '^';
foreach( $searchStringAlfa as $alfa=>$amount ){
    $regexp .= '['.$alfa.']{0,'.$amount.'}';
}
$regexp .= '$';

$searchString is the string you wish to search with. Then the only thing you should do is execute the query:

$result = mysql_query('SELECT * FROM words WHERE alfagram REGEXP "'.$regexp.'"');

May be some additional checks and optimisations are needed

Upvotes: 2

dlras2
dlras2

Reputation: 8486

Since MySQL does not support back-referencing capturing groups, the typical solution of (\w).*\1 will not work. This means that any solution given will need to enumerate all possible doubles. Furthermore, as far as I can tell back-references are not valid in look-aheads or look-behinds, and look-aheads and look-behinds are not supported in MySQL.

However, you can split this into two expressions, and use the following query:

SELECT * FROM words
WHERE word REGEXP '^[SEPIAN]{1,6}$'
AND NOT word REGEXP 'S.*?S|E.*?E|P.*?P|I.*?I|A.*?A|N.*?N'

Not very pretty, but it works and it should be fairly efficient as well.


To support a set limit of repeated characters, use the following pattern for your secondary expression:

A(.*?A){X,}

Where A is your character and X is the number of times it's allowed.

So if you're adding another N to your string SEPIANN (for a total of 2 Ns), your query would become:

SELECT * FROM words
WHERE word REGEXP '^[SEPIAN]{1,7}$'
AND NOT word REGEXP 'S.*?S|E.*?E|P.*?P|I.*?I|A.*?A|N(.*?N){2}'

Upvotes: 5

Related Questions