Mike Scotty
Mike Scotty

Reputation: 10782

Get second string parameter of function call from c sources with regex

I am trying to parse the SQLite sources for error messages and my current approach has most cases covered, I think.

My regex:

(?:sqlite3ErrorMsg|sqlite3MPrintf|sqlite3VdbeError)\([^;\"]+\"([^)]+)\"(?:,|\)|:)

Source snippet (not valid C, only for demonstration):

    sqlite3ErrorMsg(pParse, variable);       
    sqlite3ErrorMsg(pParse, "row value misused");
    ){
      sqlite3ErrorMsg(pParse, "no \"such\" function: %.*s", nId, zId);
      pNC->nErr++;
    }else if( wrong_num_args ){
      sqlite3ErrorMsg(pParse,"wrong number of arguments to function %.*s()",
           nId, zId);
      pNC->nErr++;
    }
        if( pExpr->iTable<0 ){
          sqlite3ErrorMsg(pParse,
            "second argument to likelihood must be a "
            "constant between 0.0 and 1.0");
          pNC->nErr++;
        }
    }else if( wrong_num_args ){
      sqlite3ErrorMsg(pParse,"factory must return a cursor, not \\w+", 
           nId);
      pNC->nErr++;

This successfully outputs the following capture groups:

row value misused
no \"such\" function: %.*s
second argument to likelihood must be a "
                "constant between 0.0 and 1.0
factory must return a cursor, not \\w+
          

However, it misses wrong number of arguments to function %.*s() - because of the ().

Regex101 example

I have also tried to capture from " to " with a negative look-behind to allow escaped \" (as not to skip over no \"such\" function: %.*s), but I could not get it to work, because my regex-foo is not that strong and there's also the cases of the multiline strings.

I've also tried to combine the answers from Regex for quoted string with escaping quotes with my regex, but that did not work for me, either.

The genereal idea is:

There's a function call with one of the three mentioned function names (sqlite3ErrorMsg|sqlite3MPrintf|sqlite3VdbeError), followed by a non-string parameter that I'm not interested in, followed by at least one parameter that may be either a variable (don't want that) or a string (that's what I'm looking for!), followed by an optional arbitrary number of parameters.

The string that I want may be a multiline-string and may also contain escaped quotes, parenthesis and whatever else is allowed in a C string.

I'm using Python 3.7

Upvotes: 2

Views: 125

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You may consider the following pattern:

(?:sqlite3ErrorMsg|sqlite3MPrintf|sqlite3VdbeError)\s*\(\s*\w+,((?:\s*"[^"\\]*(?:\\.[^"\\]*)*")+)

See the regex demo. You will need to remove the delimiting double quotes manually from each line in a match.

Details:

  • (?:sqlite3ErrorMsg|sqlite3MPrintf|sqlite3VdbeError) - one of the three substrings
  • \s*\(\s* - a ( char enclosed with zero or more whitespaces
  • \w+ - one or more word chars
  • , - a comma
  • ((?:\s*"[^"\\]*(?:\\.[^"\\]*)*")+) - Group 1: one or more repetitions of
    • \s* - zero or more whitespace
    • " - a "
    • [^"\\]* - zero or more chars other than \ and "
    • (?:\\.[^"\\]*)* - zero or more repetitions of a \ and then any char followed with zero or more chars other than " and \
    • " - a " char.

Sample Python code:

import re
file = "sqlite3ErrorMsg(pParse, variable);       \n    sqlite3ErrorMsg(pParse, \"row value misused\");\n    ){\n      sqlite3ErrorMsg(pParse, \"no \\\"such\\\" function: %.*s\", nId, zId);\n      pNC->nErr++;\n    }else if( wrong_num_args ){\n      sqlite3ErrorMsg(pParse,\"wrong number of arguments to function %.*s()\",\n           nId, zId);\n      pNC->nErr++;\n    }\n        if( pExpr->iTable<0 ){\n          sqlite3ErrorMsg(pParse,\n            \"second argument to likelihood must be a \"\n            \"constant between 0.0 and 1.0\");\n          pNC->nErr++;\n        }\n    }else if( wrong_num_args ){\n      sqlite3ErrorMsg(pParse,\"factory must return a cursor, not \\\\w+\", \n           nId);\n      pNC->nErr++;"
rx = r'(?:sqlite3ErrorMsg|sqlite3MPrintf|sqlite3VdbeError)\s*\(\s*\w+,((?:\s*"[^"\\]*(?:\\.[^"\\]*)*")+)'
matches = [" ".join(map(lambda x: x.strip(' "'), m.strip().splitlines())) for m in re.findall(rx, file)]
print(matches)

Output:

['row value misused', 'no \\"such\\" function: %.*s', 'wrong number of arguments to function %.*s()', 'second argument to likelihood must be a constant between 0.0 and 1.0', 'factory must return a cursor, not \\\\w+']

Upvotes: 1

Related Questions