Tryer
Tryer

Reputation: 4050

Remove empty lines and trim/squeeze blanks from files recursively

I have a folder structure thus:

----project\
       ----datafolder1\
               ----file1.txt
               ----file2.txt
       ----datafolder2\
               ----file1.txt
               ----file2.txt
       ----file1.txt
       ----file2.txt

Each of the text files has lines that contain purely numerical data (integer and decimals) as well as other information that is unnecessary. These include:

  1. blank spaces to start a line, between two numerical values of interest, before end of line, e.g.:

    <blank><blank>43<blank><tab>73.5<blank><end of line>
    

    I'd like the above to just be:

    43<blank>73.5<end of line>
    
  2. empty lines.

    I'd like these empty lines to be removed so that all interesting data is on adjacent and contiguous lines.

  3. lines with letters, e.g.:

    ---next line contains 50 customer data----
    

I want these to be removed as well.

Instead of doing these modifications manually, I'd like to automate this by a script that runs from project\ folder and recursively visits datafolder1 and then datafolder2, operates on the text files and then creates a modified text file with the above properties labelled modfile1.txt, modfile2.txt and so on.

Recursively visiting subfolders seems possible using the answer specified here. Using grep to find only lines that contain numbers seems possible according to answer here. However, that only works in case where each line of interest contains only a single number. In my case, a line of interest can contain multiple integers (positive or negative) and decimals separated by spaces. Finally, putting all of this together into a script is beyond my reach given my current knowledge of these tools. I am okay if all of this can be done in awk or .sh itself.

Upvotes: 1

Views: 446

Answers (4)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2821

The regex i use to throw away empty lines (so a whole row filled with spaces and tabs, even unicode variants, constitute an empty line in this case) just do

mawk1.3.4 'BEGIN { FS = "^$" } /[[:graph:]]/'

FS = "^$" to prevent it from wasting CPU splitting fields you don't need.

Word of caution - stick with mawk 1.3 instead.

***ps :

reason for striking gnu-awk here is that despite gawk and mawk2 matching each other on /[[:graph:]]/, some of my internal testing has realized that both would drop a bunch of korean hangul, and some emojis in the 4-byte unicode space.

only mawk1.3.4 seems to correctly account for them.

ps2 :

FS = "^$" is faster than FS = RS

Upvotes: 0

oguz ismail
oguz ismail

Reputation: 50775

You can use awk to remove blank lines and lines that contain letters, trim leading and trailing spaces, and squeeze spaces between words as well.

# selects *.txt minus mod*.txt
find . -name '*.txt' ! -name 'mod*' -exec awk '
FNR == 1 {
  close(fn)
  fn = FILENAME
  sub(/.*\//, "&mod", fn)
}
/[[:alpha:]]/ { next }
NF { $1 = $1; print > fn }' {} +

Wrt how $1 = $1 works, see Ed's answer here.

Upvotes: 4

mivk
mivk

Reputation: 14854

While it can be done with awk as shown in the accepted answer, it can also be done with Perl:

find . -name '*.txt' -exec perl -i.bak -nle '
    next unless ( /^[\s\d\.\-]+$/ && /\d/ );  # skip unwanted lines
    s/\s+/ /g;                                # keep only single spaces
    s/^\s+|\s+$//g;                           # trim whitespace at start and end
    print' {} +

This uses -i.bak to do inplace replacement, saving your original files with a .bak extension.

The -l option adds a newline, because we trimmed any whitespace characters from the end (also removing \r (CR) characters in case the files came from Windows)

If it's important to keep the original file names, you could do something like this afterwards

find . -name "*.txt.bak" -print0 \
| while IFS= read -r -d '' f; do
    mv "${f%%.bak}" "${f%%.txt.bak}-new.txt";
    mv "$f" "${f%%.bak}"
  done

Upvotes: 1

btb91
btb91

Reputation: 66

Here is a similar version using sed.

find -name \*.txt -print -exec sh -c "sed -r '/(^\s*$|[[:alpha:]])/d ; s/\s+/ /g ; s/(^\s|\s$)//g' '{}' > '{}.mod'" \;

There is a small issue with naming the new files as modfile.txt . The next time you run it, it will process modfile.txt and create modmodfile.txt . Adding a .mod suffix will prevent the modified files from being processed.

/(^\s*$|[[:alpha:]])/d  # delete blank lines or lines with alpha
s/\s+/ /g               # replace multiple spaces with one space
s/(^\s|\s$)//g          # replace space at the beginning or end of the line with nothing

Upvotes: 1

Related Questions