retrot
retrot

Reputation: 331

Remove lines from file that contain strings that match filenames in folder

I have a input.txt file with lines representing some commands, each with two input arguments:

commands a b 
commands a c
commands b c 
...

And I want to remove all lines for which there is a match (output file) in folder out. For instance, imagine that only files out/a_b_out and out/b_c_out exist. Then I would like to remove the first and third line from input.txt.

Moreover, there could be millions of files in out, so I need an efficient way to look for matches. On the other hand, the number of lines in input is on the order of few thousands, much more manageable.

I have tried to first extract the patterns from the input file (e.g. cut -d " " -f 2-3 input.txt | sed -e 's/\ /_/g'), and then looping over these entries and using grep etc.

I was wondering if there is a faster and more elegant way to perform this. Thanks!

Upvotes: 0

Views: 134

Answers (3)

George Vasiliou
George Vasiliou

Reputation: 6345

See this small test with awk (if awk is on the game) that does the opposite (just for testing):

$ cat file3
commands a b 
commands a c
commands b c

$ ls -l *_out
-rw-r--r-- 1 root root 0 Mar 15 04:02 a_b_out
-rw-r--r-- 1 root root 0 Mar 15 04:05 b_c_out

$ awk 'NR==FNR{a[$2 "_" $3 "_out"]=$0;next}($0 in a){print a[$0]}' file3 <(find . -maxdepth 1 -type f -printf %f\\n)
commands b c
commands a b 

Meaning that this inverted command should give you the results you need:

$ awk 'NR==FNR{a[$2 "_" $3 "_out"]=$0;next}(!($0 in a)){print a[$0]}' inuutfile <(find . -maxdepth 1 -type f -printf %f\\n) >newfile

You can remove the maxdepth 1 to go inside all subdirectories.

This solution builts an index based on the small input file and not on the million of files that may exist in out;thus performance is expected to be good enough.

Sending the non matching results to a new file will be much faster than continuously overwritting the existed file.

You can just move newfile over oldfile when you are done (mv newfile inputfile)

Upvotes: 0

karakfa
karakfa

Reputation: 67507

this might work for your case

while read c x y; 
do [ -f "out/${x}_${y}_out" ] || echo "$c" "$x" "$y" 
done < input

will iterate the shorter input file and filter the lines based on existing files; the output will be the commands where the files are not found. If your input file is not well formed you may need to strengthen read command.

Upvotes: 3

mklement0
mklement0

Reputation: 438083

Unless you need awk for additional processing or you need to preserve input lines exactly as-is in terms of whitespace, consider karakfa's helpful shell-only solution.

An awk solution:

Given that there can be millions of files in out/, building an index of filenames is not an option, but you can defer to the shell to test file existence.

This will be slow, because a sh child process is created for each input line, but may be acceptable with input on the order of a few thousand lines:

awk '{ fpath = "out/" $2 "_" $3 "_out"; if (1 == system("[ -f '" fpath "' ]")) print }' \
  input.txt > input.tmp.$$.txt && mv input.tmp.$$.txt input.txt

Upvotes: 0

Related Questions