Reputation: 331
I have a input.txt
file with lines representing some commands, each with two input arguments:
commands a b
commands a c
commands b c
...
And I want to remove all lines for which there is a match (output file) in folder out
. For instance, imagine that only files out/a_b_out
and out/b_c_out
exist. Then I would like to remove the first and third line from input.txt
.
Moreover, there could be millions of files in out
, so I need an efficient way to look for matches. On the other hand, the number of lines in input
is on the order of few thousands, much more manageable.
I have tried to first extract the patterns from the input file (e.g. cut -d " " -f 2-3 input.txt | sed -e 's/\ /_/g'
), and then looping over these entries and using grep etc.
I was wondering if there is a faster and more elegant way to perform this. Thanks!
Upvotes: 0
Views: 134
Reputation: 6345
See this small test with awk (if awk is on the game) that does the opposite (just for testing):
$ cat file3
commands a b
commands a c
commands b c
$ ls -l *_out
-rw-r--r-- 1 root root 0 Mar 15 04:02 a_b_out
-rw-r--r-- 1 root root 0 Mar 15 04:05 b_c_out
$ awk 'NR==FNR{a[$2 "_" $3 "_out"]=$0;next}($0 in a){print a[$0]}' file3 <(find . -maxdepth 1 -type f -printf %f\\n)
commands b c
commands a b
Meaning that this inverted command should give you the results you need:
$ awk 'NR==FNR{a[$2 "_" $3 "_out"]=$0;next}(!($0 in a)){print a[$0]}' inuutfile <(find . -maxdepth 1 -type f -printf %f\\n) >newfile
You can remove the maxdepth 1 to go inside all subdirectories.
This solution builts an index based on the small input file and not on the million of files that may exist in out;thus performance is expected to be good enough.
Sending the non matching results to a new file will be much faster than continuously overwritting the existed file.
You can just move newfile over oldfile when you are done (mv newfile inputfile
)
Upvotes: 0
Reputation: 67507
this might work for your case
while read c x y;
do [ -f "out/${x}_${y}_out" ] || echo "$c" "$x" "$y"
done < input
will iterate the shorter input file and filter the lines based on existing files; the output will be the commands where the files are not found. If your input file is not well formed you may need to strengthen read command.
Upvotes: 3
Reputation: 438083
Unless you need awk
for additional processing or you need to preserve input lines exactly as-is in terms of whitespace, consider karakfa's helpful shell-only solution.
An awk
solution:
Given that there can be millions of files in out/
, building an index of filenames is not an option, but you can defer to the shell to test file existence.
This will be slow, because a sh
child process is created for each input line, but may be acceptable with input on the order of a few thousand lines:
awk '{ fpath = "out/" $2 "_" $3 "_out"; if (1 == system("[ -f '" fpath "' ]")) print }' \
input.txt > input.tmp.$$.txt && mv input.tmp.$$.txt input.txt
Upvotes: 0