piedog
piedog

Reputation: 83

When using regex in C, \d does not work but [0-9] does

I do not understand why the regex pattern containing the \d character class does not work but [0-9] does. Character classes, such as \s (whitespace characters) and \w (word characters), do work. My compiler is gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3. I am using the C regular expression library.

Why doesn't \d work?

Text string:

const char *text = "148  apples    5 oranges";

For the above text string, this regex does not match:

const char *rstr = "^\\d+\\s+\\w+\\s+\\d+\\s+\\w+$";

This regex matches when using [0-9] instead of \d:

const char *rstr = "^[0-9]+\\s+\\w+\\s+[0-9]+\\s+\\w+$";



#include <stdio.h>
#include <stdlib.h>
#include <regex.h>

#define N_MATCHES  30

//   output from gcc --version: gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
//   compile command used:  gcc -o tstc_regex tstc_regex.c

const char *text = "148  apples    5 oranges";
  const char *rstr = "^[0-9]+\\s+\\w+\\s+[0-9]+\\s+\\w+$";    // finds match
//const char *rstr = "^\\d+\\s+\\w+\\s+\\d+\\s+\\w+$";        // does not find match

int main(int argc, char**argv)
{
    regex_t   rgx;
    regmatch_t   matches[N_MATCHES];
    int status;
    status = regcomp(&rgx, rstr, REG_EXTENDED | REG_NEWLINE);
    if (status != 0) {
        fprintf(stdout, "regcomp error: %d\n", status);
        return 1;
    }
    status = regexec(&rgx, text, N_MATCHES, matches, 0);
    if (status == REG_NOMATCH) {
        fprintf(stdout, "regexec result: REG_NOMATCH (%d)\n", status);
    }
    else if (status != 0) {
        fprintf(stdout, "regexec error: %d\n", status);
        return 1;
    }
    else {
        fprintf(stdout, "regexec match found: %d\n", status);
    }
    return 0;
}

Upvotes: 8

Views: 3706

Answers (4)

Alan Moore
Alan Moore

Reputation: 75242

The regex flavor you're using is GNU ERE, which is similar to POSIX ERE, but with a few extra features. Among these are support for the character class shorthands \s, \S, \w and \W, but not \d and \D. You can find more info here.

Upvotes: 9

l&#39;L&#39;l
l&#39;L&#39;l

Reputation: 47264

Trying either pattern in a strictly POSIX environment will likely end up having no matches; if you want to make the pattern truly POSIX compatible use all bracket expressions:

const char *rstr = "^[[:digit:]]+[[:space:]]+[[:alpha:]]+[[:space:]]+[[:digit:]]+[[:space:]]+[[:alpha:]]+$";

POSIX Character_classes

Upvotes: 5

lostbard
lostbard

Reputation: 5220

\d is a perl and vim character class.

Use instead:

 const char *rstr = "^[[:digit:]]+\\s+\\w+\\s+[[:digit:]]+\\s+\\w+$"; 

Upvotes: 1

Chris Dodd
Chris Dodd

Reputation: 126418

According to the POSIX regular expression spec:

An ordinary character is any character in the supported character set, except for the ERE special characters listed in ERE Special Characters. The interpretation of an ordinary character preceded by a backslash ( '\' ) is undefined.

So the only characters that can legally follow a \ are:

\^    \.    \[    \$    \(    \)    \|
\*    \+    \?    \{    \\

all of which match the escaped character literally. Trying to use any of of the other PCRE extensions may not work.

Upvotes: 1

Related Questions