Digital Ninja
Digital Ninja

Reputation: 3741

Find all distinct regex results, not all in one line

I want to find all different parameters passed to the __() function in my project. So far the best grep call I've constructed is this one:

find . -name "*.php" | xargs grep "__('.*')" -sioh

It successfully finds all calls to the __() function, but it has the following problems:

  1. It prints the entire __() function call instead of only the parameter
  2. It prints both function calls in the same line when there are multiple calls to the function in the same line in the original file

What I want is a list of all distinct parameters passed to the function, so I would like each parameter to be in its own line (no __( at the beginning and no ) at the end).

For an example line that looks like this:

/* Some code */ __('foo'); /* Some more code */ __('bar'); /* Even more code */

My command returns the following result:

__('foo'); /* Some more code */ __('bar')

What I would like to get is this (in their distinct lines without quotes):

foo
bar

Edited:

As it turns out, the first argument is not always a single quoted string. Sometimes it's a variable (starting with a $ sign as it's PHP in question, and optionally having array indexes, e.g. $a['b']).

And there are two more optional boolean arguments. But it's only the first argument I actually care about getting in the result, the other two are not important.

Upvotes: 3

Views: 187

Answers (3)

mklement0
mklement0

Reputation: 439727

This answer assumes the following, in line with the OP's later clarification:
- __() calls in the input data have 1-3 arguments, not necessarily single-quoted.
- Only the 1st argument should be extracted.
- The 1st argument itself contains neither , nor ).

Try the following, which should work on most platforms:

find . -name "*.php" -exec grep -sioh "__([^,)]*" {} + | cut -c 4-
  • -exec with + ensures that as few invocations of grep as possible are performed (in most cases, just 1); {} is the placeholder for the matching filenames.
  • As pointed out in Etan Reisner's answer, the grep regex should be less greedy to ensure that multiple invocations on a line are captured; furthermore, since it's now clear that only the 1st argument should be extracted, [^,)]* is used to capture only up to the next argument or the closing parenthesis. (Note that this could still fail if the 1st argument itself contains a comma or parenthesis).
  • The cut command removes the unwanted parts from grep's output (strips the __( prefix).

If your grep implementation supports -R (for recursive search) and --include (to restrict files searched to those matching a glob), you can use

 grep -R --include '*.php' -sioh "__([^,)]*" . | cut -c 4-

If your grep implementation additionally supports -P (PCREs: Perl-compatible regexes), use a modified version of anubhava's answer:

 grep -R --include '*.php' -siohP "__\(\K[^,)]*"

Using -P makes it easier to make the regex more robust by appending a lookahead assertion ((?=...)) to ensure that the captured token is indeed followed by literal , or ).

 grep -R --include '*.php' -siohP "__\(\K[^,)]*(?=[,)])"

Finally, note how grep with -P requires \( to match a literal (, whereas the non-P grep commands above use basic regular expressions, where ( are not special and are treated as literals (there, you'd have to use \( to make them special).

In grep implementations without -P, invoking grep as egrep or using -E activates support for extended regular expressions, which have more features and are closer in syntax to PCREs, but are not as powerful.


A note on portability:

  • -P (support for PCREs == Perl-Compatible Regular Expressions) is a GNU grep extension (won't work in BSD grep).
  • -o is an extension found in (at least) GNU grep and BSD grep.
  • -R and --include are extensions found in (at least) GNU grep and BSD grep.

Upvotes: 4

Etan Reisner
Etan Reisner

Reputation: 81012

This isn't as good as anubhava's answer but it is better and works for grep without PCRE flags.

Using [^)]* instead of .* in the match will stop matches at the end of the function instead of the end of the last function call on the line.

$ grep -sioh "__('[^)]*')" *.php
__('foo')
__('bar')

Upvotes: 1

anubhava
anubhava

Reputation: 785781

Use this grep -P (PCRE):

grep -HoP '__\(\K[^)]*' *.php
file.php:'foo'
file.php:'bar'

It finds __\( and \K resets the matched data. [^)]* then finds text before ).

Upvotes: 2

Related Questions