Miloš Đakonović
Miloš Đakonović

Reputation: 3871

Grep any whitespace character including newline in single pattern

I'm trying to make 'perfect' command to show any .php file in dir or subdirs that contain eval code.

Since there are many many false positives, I'm after solution that would strip at least most obvious of them - so my target is:

word eval, followed by any whitepace char including newline zero to unlimited times, followed by open bracket char (;

Here are my shots:

find . -type f -exec grep -l "eval\s*(" {} \; | grep ".php"

Works great but somehow \s* here doesn't match newline characters, so

eval

("some nasty obfuscated code");

is bellow the radar.

I've also tried with:

find . -type f -exec grep -l "eval[[:space:]]*(" {} \; | grep ".php"

with same results.

Upvotes: 2

Views: 3757

Answers (2)

hmedia1
hmedia1

Reputation: 6200

Simple Version:

For simplicity sake, to cater for your need, but using awk instead of grep (if this is possible), then for php files in /tmp/, you could simply;

awk -v RS="^$" '/eval[[:space:]]*\(/ { print FILENAME }' /tmp/*.php

And that will print the files that match.

If you need to use the output of find:

find /tmp/ -iname "*.php" -print | while read file ; do awk -v RS="^$" '/eval[[:space:]]*\(/ { print FILENAME }' "$file" ; done

The above is simple and works even with busybox and basic versions of awk.

Alternate (With matches)

This part of the answer may seem absurd to some, but enough experience with searching for whitespace, and doing serialisation in the shell, the amount of "gotcha's" become evident, and the need for a working solution causes the preference for built-in one liners to take a back seat.

This might also help others stumbling across a similar need, but requiring easy to read line previews, maybe for parsing, or simplicity:

NOTE 1: This solution works in sh/ash/busybox as well as bash (the external binary xxd would still be needed)

NOTE 2: For BSD grep, substitute -P with -E. Using -E on a GNU grep that has support for -P, seems to not yield the same lookahead matches

Example Test File

Take this test file (with special characters notated in place), plus 2 other test files that are located in /tmp/ for this example:

eval file

find /tmp/ -iname "*.php" -print \
| while read file ; do hexdump -ve '1/1 " %02X"' "$file" \
| sed -E "s/($)/ 0A/g" \
| grep -P -o "65 76 61 6C( 09| 0A| 0B| 0C| 0D| 20)*? 28 22.+?0A" \
| sed -E -e 's/ //g' \
| sed -E -e 's/(0A)+([^$])/20\2/g' \
| sed -E -e 's/(09|0B|0C|0D|20)+/20/g' \
| xxd -r -p \
| grep -i "eval" && printf "$file matches\n\n" ; done

Will return the matches, from eval, to the end of the line where the (" was matched, substituting line breaks and spaces for a single space for readability :

eval ("some nasty obfuscated code (LF / LINE FEED)");
eval ("some nasty obfuscated code (HT / TAB)");
eval ("some nasty obfuscated code (SP / SPACE)");
eval ("some nasty obfuscated code (FF / FORM FEED)");
eval ("some nasty obfuscated code (CR / CARRIAGE RETURN)");
eval ("some nasty obfuscated code (VT / VERTICAL TAB)");
eval ("some nasty obfuscated code (LF > HT > FF > CR > LF > LF > HT > VT > LF > HT > SP)");
eval ("some nasty obfuscated code (VT / VERTICAL TAB)");
/tmp/eval.php matches

eval ("some nasty obfuscated code (LF / LINE FEED)");
/tmp/eval_no_trailing_line_feed.php matches

eval("\$str = \"$str\";");
/tmp/eval_w3_example.php matches

For just the file matches using this method (maybe to allow for a "-v" option for example), just change grep -i on the last line to grep -iq

Explanation:

find /tmp/ -iname "*.php" -print \ : Find .php files in /tmp/

| while read file ; do hexdump -ve '1/1 " %02X"' "$file" \ : hexdump each resulting file, and output in single space separated bytes (to avoid any matching from the second character of one byte to the first char of another byte)

| sed -E "s/($)/ 0A/g" \ : Put a single 0A (line feed) at the very end of the file that matches - This means it will match a file that does not have a trailing line feed (sometimes can cause some issues with text processing)

| grep -P -o "65 76 61 6C( 09| 0A| 0B| 0C| 0D| 20)*? 28 22.+?0A" \ : Return only match (note that grep adds a line break to each match)

  • 6576616C : eval
  • 09 : horizontal TAB
  • 0A : line feed
  • 0B : vertical TAB
  • 0C : form feed
  • 0D : carriage return
  • 20 : plane SPACE
  • 2822 : ("

| sed -E -e 's/ //g' \ : Remove all spaces between bytes (may not have been needed in the end)

| sed -E -e 's/(0A)+([^$])/20\2/g' \ : Look for any repeated occurrences of 0A (line feed), as long as they are not the line feed at the end of the line, and replace them with a single space (20)

| sed -E -e 's/(09|0B|0C|0D|20)+/20/g' \ : Look for any of the white space characters above, and replace them with a space, for readability

| xxd -r -p \ : Revert back from hex

| grep -i "eval" && printf "$file matches\n\n" ; done : Print the match, and the file name (the && means that printf will only print the file match, if the output of grep was 0 (success), therefore it won't simply print every file in the loop. (as noted before, adding -q into this grep will still evaluate for the purpose of printf, but will not output the matching lines.

Upvotes: 0

talz
talz

Reputation: 1200

If I did understand you correct, I believe this line here to be what you're looking for:

find . -name '*.php' -exec grep -Ezl 'eval\s*\(' {} +

the -z is what you've been missing, see explanation below. and of course you could give the find command whatever other root rather than . and just add arguments and conditions according to where you are looking in and what you are looking for.

That was it. From here on, explanations:

The find command

It would probably be faster in most cases to first search for files with .php extension, and then search only within these files for your regular expression. The -name '*.php' part gives us this behavior by searching only for files with a file name ending with '.php'.

-exec allows us to execute a command using the output of the find command (file names). We are using it in order to execute grep for all php files.

This syntax {} + in the end of the line, creates one long list of file names as arguments for the grep command, instead of executing grep separately for every file.

The grep command

-E: Interpret PATTERN as an extended regular expression (copied from the grep man page)

-z: Treat the input as a set of lines, each terminated by a zero byte instead of a newline (grep man page). That means that for a normal textual file, the whole file would be treated as one long line. This behavior allows you to use multi-lined regular expressions.

-l: tells grep to only show the filenames for all the files matching the search, and not to show the matching lines.

The regular expression:

'eval' just matches the word eval. '\s' matches any whitespace character, and the '*' after it means it could appear zero or more times. This '\(' matches an actual bracket, which in this case needs escaping (and that's what the \ is for).

have fun!

Upvotes: 1

Related Questions