pandagrammer
pandagrammer

Reputation: 871

Grep N times from pipe using xargs

I have a file named input that contains a list of wikipedia or substring of wikipedia titles. I only want to print out the lines that are wikipedia titles, not the substring.

I have another file named wikititle that contains a list of all wikipedia titles. So I want to grep each line from input and if it matches with ^{string}$, I want to print out that line.

I came up with below command:

cat input | xargs -0 -I{} bash -c 'grep -q -w ^{}$ wikititle && { echo {}; }'

But it gives me an error of:

 xargs: command too long

How do I make this happen? Thanks!

Upvotes: 0

Views: 359

Answers (2)

Charles Duffy
Charles Duffy

Reputation: 295650

The right way to print out lines which are found in both of two files is with comm:

comm -12 <(sort input) <(sort wikititle)

This is vastly more efficient than what you were trying to do: It runs only a single pass, and needs to store very little content in memory at a time (sort can have larger memory requirements, but the GNU implementation supports using disk-backed temporary storage).


Another much more efficient approach would be the following:

grep -F -x -f input wikititle

...this would run grep only once, using all the (newline-separated) strings given in input, against the contents of wikititle.

Using grep -F avoids treating arguments as regexes, so that even strings like Foo [Bar] will match themselves when fully anchored (with they wouldn't with a grep which treated [Bar] as a character class). Using -x requires full-line matches (thank you, @tripleee!).


...and, if you really wanted to use xargs and a whole bunch of separate grep calls and a shell-level echo for no good reason...

<input xargs bash -c \
  'for line; do grep -q -F -x -e "$line" wikititle && printf '%s\n' "$line"; done' _

Note that this doesn't use -I '{}', which is an option which makes xargs far less efficient (forcing it to run a command once for every single match), and also introduces potential security bugs when used with bash -c (if a line in your input file contains $(rm -rf ~), you probably don't want to execute it). Instead, it uses a for loop in your bash to iterate over filenames passed as arguments.

Upvotes: 3

Ed Morton
Ed Morton

Reputation: 204025

Without sample input and expected output it's a guess but it sounds like all you need is:

awk 'NR==FNR{titles[$0];next} $0 in titles' wikititle input

Remember that shell is an environment from which to manipulate files and processes and invoke tools, NOT a tool to manipulate text. The guys who created shell also created awk for shell to call to manipulate text.

Upvotes: 1

Related Questions