villaa
villaa

Reputation: 1239

regex.h matching differences between OSX and Linux

I need to match the following line with multiple capturing groups:

0.625846        29Si    29      [4934.39        0]      [0.84   100000000000000.0]

I use the regex:

^(0+\.[0-9]?e?[+-]?[0-9]+)\s+([0-9]+\.?[0-9]*|[0-9][0-9]?[0-9]?[A-Z][a-z]?)\s+([0-9][0-9]?[0-9]?)\s+(\[.*\])\s+(\[.*\])$

see this link for a regex101 workspace. However I find that when I'm trying the matching using regex.h it behaves differently on OSX or linux, specifically:

Fails on: OSX: 10.14.6 LLVM: 10.0.1 (clang-1001.0.46.4)

Works on: linux: Ubuntu 18.04 g++: 7.5.0

I worked up a brief code the reproduces the problem, compiled with g++ regex.cpp -o regex:

#include <iostream>

//regex
#include <regex.h>

using namespace std;

int main(int argc, char** argv) {


  //define a buffer for keeping results of regex matching 
  char       buffer[100];

  //regex object to use
  regex_t regex;

  //*****regex match and input file line*******
  string iline = "0.625846        29Si    29      [4934.39        0]      [0.84   100000000000000.0]";
  string matchfile="^(0+\\.[0-9]?e?[+-]?[0-9]+)\\s+([0-9]+\\.?[0-9]*|[0-9][0-9]?[0-9]?[A-Z][a-z]?)\\s+([0-9][0-9]?[0-9]?)\\s+(\\[.*\\])\\s+(\\[.*\\])$";


  //compile the regex 
  int reti = regcomp(&regex,matchfile.c_str(),REG_EXTENDED);

  regerror(reti, &regex, buffer, 100);

  if(reti==0)
    printf("regex compile success!\n");
  else
    printf("regcomp() failed with '%s'\n", buffer);


  //match the input line
  regmatch_t input_matchptr[6];
  reti = regexec(&regex,iline.c_str(),6,input_matchptr,0);

  regerror(reti, &regex, buffer, 100);

  if(reti==0)
    printf("regex compile success!\n");
  else
    printf("regexec() failed with '%s'\n", buffer);

  //******************************************

  return 0;

I have also modified my regex to comply with POSIX (I think?) by removing the previous use of +? and *? operators as per this post but may have missed something that makes me incompatible with POSIX? However, the regex now seems to compile correctly which makes me thing I used a valid regex but I still don't understand why no match is obtained. Which I understand that LLVM requires.

How can I modify my regex to correctly match?

Upvotes: 1

Views: 183

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627302

To answer the immediate question, you need to use

string matchfile="^(0+\\.[0-9]?e?[+-]?[0-9]+)[[:space:]]+([0-9]+\\.?[0-9]*|[0-9][0-9]?[0-9]?[A-Z][a-z]?)[[:space:]]+([0-9][0-9]?[0-9]?)[[:space:]]+(\\[.*\\])[[:space:]]+(\\[.*\\])$";

That is, instead of Perl-like \s, you can use [:space:] POSIX character class inside a bracket expression.

You mention that you tried [:space:] outside of a bracket expression, and it did not work - that is expected. As per Character Classes,

[:digit:] is a POSIX character class, used inside a bracket expression like [x-z[:digit:]].

This means that POSIX character classes are only parse as such when used inside bracket expressions.

Upvotes: 1

Related Questions