Reputation: 919

find strings in source code by using regex in perl

I am studying on regular expression in perl.

I want to write a script that accepts a C source code file and finds strings.

This is my code:

my $file1= @ARGV;
open my $fh1, '<', $file1;
while(<>)
{
  @words = split(/\s/, $_);
  $newMsg = join '', @words;
  push  @strings,($newMsg =~ m/"(.*\\*.*\\*.*\\*.*)"/) if($newMsg=~/".*\\*.*\\*.*\\*.*"/);
  print Dumper(\@strings);
foreach(@strings)
    {
    print"strings: $_\n"; 
    }

but i have problem in matching multiple string like this

const char *text2 =
"Here, on the other hand, I've gone crazy\
and really let the literal span several lines\
without bothering with quoting each line's\
content. This works, but you can't indent";

what i must do?

Upvotes: 1

Answers (3)

amon

Reputation: 57590

Here is a simple way of extracting all strings in a source file. There is an important decision we can make: Do we preprocess the code? If not, we may miss some strings if they are generated via macros. We would also have to treat the # as a comment character.

As this is a quick-and-dirty solution, syntactic correctness of the C code is not an issue. We will however honour comments.

Now if the source was pre-processed (with gcc -E source.c), then multiline strings are already folded into one line! Also, comments are already removed. Sweet. The only comments that remain are mention line numbers and source files for debugging purposes. Basically all that we have to do is

$ gcc -E source.c | perl -nE'
  next if /^#/;  # skip line directives etc.
  say $1 while /(" (?:[^"\\]+ | \\.)* ")/xg;
'

Output (with the test file from my other answer as input):

""
"__isoc99_fscanf"
""
"__isoc99_scanf"
""
"__isoc99_sscanf"
""
"__isoc99_vfscanf"
""
"__isoc99_vscanf"
""
"__isoc99_vsscanf"
"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"Hello %s:\n%s\n"
"World"

So yes, there is a lot of garbage here (they seem to come from __asm__ blocks), but this works astonishingly well.

Note the regex I used: /(" (?:[^"\\]+ | \\.)* ")/x. The pattern inside the capture can be explained as

"         # a literal '"'
(?:       # the begin of a non-capturing group
  [^"\\]+ # a character class that matches anything but '"' or '\', repeated once or more
|
  \\.     # an escape sequence like '\n', '\"', '\\' ...
)*        # zero or more times
"         # closing '"'

What are the limitations of this solution?

We need the a preprocessor
- This code was tested with gcc
- clang also supports the -E option, but I have no idea how the output is formatted.
Character literals are a failure mode, e.g. myfunc('"', a_variable, '"') would be extracted as "', a_variable, '".
We also extract strings from other source files. (false positives)

Oh wait, we can fix the last bit by parsing the source file comments which the preprocessor inserted. They look like

# 29 "/usr/include/stdio.h" 2 3 4

So if we remeber the current filename, and compare it to the filename we want, we can skip unwanted strings. This time, I'll write it as a full script instead of a one-liner.

use strict; use warnings;
use autodie;  # automatic error handling
use feature 'say';

my $source = shift @ARGV;
my $string_re = qr/" (?:[^"\\]+ | \\.)* "/x;

# open a pipe from the preprocessor
open my $preprocessed, "-|", "gcc", "-E", $source;

my $file;
while (<$preprocessed>) {
  $file = $1 if /^\# \s+ \d+ \s+ ($string_re)/x;
  next if /^#/;
  next if $file ne qq("$source");
  say $1 while /($string_re)/xg;
}

Usage: $perl extract-strings.pl source.c

This now produces the output:

"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"Hello %s:\n%s\n"
"World"

If you cannot use the convenient preprocessor to fold multiline strings and remove comments, this gets a lot uglier, because we have to account for all of that ourselves. Basically, you want to slurp in the whole file at once, not iterate it line by line. Then, you skip over any comments. Do not forget to ignore preprocessor directives as well. After that, we can extract the strings as usual. Basically, you have to rewrite the grammar

Start → Comment Start
Start → String Start
Start → Whatever Start
Start → End

to a regex. As the above is a regular language, this isn't too hard.

Upvotes: 1

amon

Reputation: 57590

Here is a fun solution. It uses MarpaX::Languages::C::AST, an experimental C parser. We can use the c2ast.pl program that ships with the module to convert a piece of C source file to an abstract syntax tree, which we dump to some file (using Data::Dumper). We can then extract all strings with a bit of magic.

Unfortunately, the AST objects have no methods, but as they are autogenerated, we know how they look on the inside.

They are blessed arrayrefs.
- Some contain a single unblessed arrayrefs of items,
- Others contain zero or more items (lexemes or objects)
“Lexemes” are an arrayref with two fields of location information, and the string contents at index 2.

This information can be extracted from the grammar.

The code:

use strict; use warnings;
use Scalar::Util 'blessed';
use feature 'say';

our $VAR1;
require "test.dump"; # populates $VAR1

my @strings = map extract_value($_), find_strings($$VAR1);
say for @strings;

sub find_strings {
  my $ast = shift;
  return $ast if $ast->isa("C::AST::string");
  return map find_strings($_), map flatten($_), @$ast;
}

sub flatten {
  my $thing = shift;
  return $thing if blessed($thing);
  return map flatten($_), @$thing if ref($thing) eq "ARRAY";
  return (); # we are not interested in other references, or unblessed data
}

sub extract_value {
  my $string = shift;
  return unless blessed($string->[0]);
  return unless $string->[0]->isa("C::AST::stringLiteral");
  return $string->[0][0][2];
}

A rewrite of find_strings from recursion to iteration:

sub find_strings {
  my @unvisited = @_;
  my @found;
  while (my $ast = shift @unvisited) {
    if ($ast->isa("C::AST::string")) {
      push @found, $ast;
    } else {
      push @unvisited, map flatten($_), @$ast;
    }
  }
  return @found;
}

The test C code:

/* A "comment" */
#include <stdio.h>

static const char *text2 =
"Here, on the other hand, I've gone crazy\
and really let the literal span several lines\
without bothering with quoting each line's\
content. This works, but you can't indent"; 

int main() {
        printf("Hello %s:\n%s\n", "World", text2);
        return 0;
}

I ran the commands

$ perl $(which c2ast.pl) test.c -dump >test.dump;
$ perl find-strings.pl

Which produced the output

"Here, on the other hand, I've gone crazyand really let the literal span several lineswithout bothering with quoting each line'scontent. This works, but you can't indent"
"World"
"Hello %s\n"
"" 
"" 
"" 
"" 
"" 
""

Notice how there are some empty strings not from our source code, which come somewhere from the included files. Filtering those out would probably not be impossible, but is a bit impractical.

Upvotes: 4

PP.

Reputation: 10864

It appears you're trying to use the following regular expression to capture multiple lines in a string:

my $your_regexp = m{
    (
        .*  # anything
        \\* # any number of backslashes
        .*  # anything
        \\* # any number of backslashes
        .*  # anything
        \\* # any number of backslashes
        .*  # anything
    )
}x

But it appears more of a grasp of desperation than a deliberately thought out plan.

So you've got two problems:

find everything between double quotes (")
handle the situation where there might be multiple lines between those quotes

Regular expressions can match across multiple lines. The /s modifier does this. So try:

my $your_new_regexp = m{
    \"       # opening quote mark
    ([^\"]+) # anything that's not a quote mark, capture
    \"       # closing quote mark
}xs;

You might actually have a 3rd problem:

remove trailing backslash/newline pairs from strings

You could handle this by doing a search-replace:

foreach ( @strings ) {
    $_ =~ s/\\\n//g;
}

Upvotes: 3

find strings in source code by using regex in perl

Answers (3)

Related Questions