Descartes
Descartes

Reputation: 69

Remove text files with less than three lines

I'm using an Awk script to split a big text document into independent files. I did it and now I'm working with 14k text files. The problem here is there are a lot of files with just three lines of text and it's not useful for me to keep them.

I know I can delete lines in a text with awk 'NF>=3' file, but I don't want to delete lines inside files, rather I want to delete files which content is just two or three text lines.

Thanks in advance.

Upvotes: 2

Views: 918

Answers (4)

stack0114106
stack0114106

Reputation: 8711

You can try Perl. The below solution will be efficient as the file handle ARGV will be closed if the line count > 3

 perl -nle ' close(ARGV) if ($.>3) ; $kv{$ARGV}++; END { for(sort keys %kv) { print if $kv{$_}>3 } } ' * 

If you want to pipe the output of some other command (say find) you can use it like

$ find . -name "*" -type f -exec perl -nle ' close(ARGV) if ($.>3) ; $kv{$ARGV}++; END { for(sort keys %kv) { print if $kv{$_}>3 } } ' {} \;
./bing.fasta
./chris_smith.txt
./dawn.txt
./drcatfish.txt
./foo.yaml
./ip.txt
./join_tab.pl
./manoj1.txt
./manoj2.txt
./moose.txt
./query_ip.txt
./scottc.txt
./seats.ksh
./tane.txt
./test_input_so.txt
./ya801.txt

$

the output of wc -l * on the same directory

$ wc -l *
  12 bing.fasta
  16 chris_smith.txt
   8 dawn.txt
   9 drcatfish.txt
   3 fileA
   3 fileB
  13 foo.yaml
   3 hubbs.txt
   8 ip.txt
  19 join_tab.pl
   6 manoj1.txt
   6 manoj2.txt
   5 moose.txt
  17 query_ip.txt
   3 rororo.txt
   5 scottc.txt
  22 seats.ksh
   1 steveman.txt
   4 tane.txt
  13 test_input_so.txt
  24 ya801.txt
 200 total

$

Upvotes: 1

agc
agc

Reputation: 8406

If the files in the current directory are all text files, this should be efficient and portable:

for f in *; do 
    [ $(head -4 "$f" | wc -l) -lt 4 ] && echo "$f"
done  # | xargs rm

Inspect the list, and if it looks OK, then remove the # on the last line to actually delete the unwanted files.

Why use head -4? Because wc doesn't know when to quit. Suppose half of the text files were each more than a terabyte long; if that were the case wc -l alone would be quite slow.

Upvotes: 3

RavinderSingh13
RavinderSingh13

Reputation: 133630

Could you please try following findcommand.(tested with GNU awk)

find /your/path/ -type f -exec awk -v lines=3 'NR>lines{f=1; exit} END{if (!f) print FILENAME}' {} \;

So above will print file names who are having lesser than 3 lines on console. Once you are happy with results coming then try following to delete them. Only once you are ok with above command's output run following and even I will suggest run below command in a test directory first and once you are fully satisfied then proceed with below one.(remove echo from below I have still put it for safer side :) )

find /your/path/ -type f -exec awk -v lines=3 'NR>lines{f=1; exit} END{exit !f}' {} \; -exec echo rm -f {} \;

Upvotes: 3

Neal.Marlin
Neal.Marlin

Reputation: 518

You may use wc to calculate lines and then decide either to delete the file or not. you should write a shell script instead of just awk command.

Upvotes: 1

Related Questions