Reputation: 871
I have a file named input that contains a list of wikipedia or substring of wikipedia titles. I only want to print out the lines that are wikipedia titles, not the substring.
I have another file named wikititle that contains a list of all wikipedia titles. So I want to grep each line from input and if it matches with ^{string}$, I want to print out that line.
I came up with below command:
cat input | xargs -0 -I{} bash -c 'grep -q -w ^{}$ wikititle && { echo {}; }'
But it gives me an error of:
xargs: command too long
How do I make this happen? Thanks!
Upvotes: 0
Views: 359
Reputation: 295650
The right way to print out lines which are found in both of two files is with comm
:
comm -12 <(sort input) <(sort wikititle)
This is vastly more efficient than what you were trying to do: It runs only a single pass, and needs to store very little content in memory at a time (sort
can have larger memory requirements, but the GNU implementation supports using disk-backed temporary storage).
Another much more efficient approach would be the following:
grep -F -x -f input wikititle
...this would run grep
only once, using all the (newline-separated) strings given in input
, against the contents of wikititle
.
Using grep -F
avoids treating arguments as regexes, so that even strings like Foo [Bar]
will match themselves when fully anchored (with they wouldn't with a grep which treated [Bar]
as a character class). Using -x
requires full-line matches (thank you, @tripleee!).
...and, if you really wanted to use xargs
and a whole bunch of separate grep
calls and a shell-level echo
for no good reason...
<input xargs bash -c \
'for line; do grep -q -F -x -e "$line" wikititle && printf '%s\n' "$line"; done' _
Note that this doesn't use -I '{}'
, which is an option which makes xargs
far less efficient (forcing it to run a command once for every single match), and also introduces potential security bugs when used with bash -c
(if a line in your input file contains $(rm -rf ~)
, you probably don't want to execute it). Instead, it uses a for
loop in your bash to iterate over filenames passed as arguments.
Upvotes: 3
Reputation: 204025
Without sample input and expected output it's a guess but it sounds like all you need is:
awk 'NR==FNR{titles[$0];next} $0 in titles' wikititle input
Remember that shell is an environment from which to manipulate files and processes and invoke tools, NOT a tool to manipulate text. The guys who created shell also created awk for shell to call to manipulate text.
Upvotes: 1