Reputation: 4050
I have a folder structure thus:
----project\
----datafolder1\
----file1.txt
----file2.txt
----datafolder2\
----file1.txt
----file2.txt
----file1.txt
----file2.txt
Each of the text files has lines that contain purely numerical data (integer and decimals) as well as other information that is unnecessary. These include:
blank spaces to start a line, between two numerical values of interest, before end of line, e.g.:
<blank><blank>43<blank><tab>73.5<blank><end of line>
I'd like the above to just be:
43<blank>73.5<end of line>
empty lines.
I'd like these empty lines to be removed so that all interesting data is on adjacent and contiguous lines.
lines with letters, e.g.:
---next line contains 50 customer data----
I want these to be removed as well.
Instead of doing these modifications manually, I'd like to automate this by a script that runs from project\
folder and recursively visits datafolder1
and then datafolder2
, operates on the text files and then creates a modified text file with the above properties labelled modfile1.txt
, modfile2.txt
and so on.
Recursively visiting subfolders seems possible using the answer specified here. Using grep to find only lines that contain numbers seems possible according to answer here. However, that only works in case where each line of interest contains only a single number. In my case, a line of interest can contain multiple integers (positive or negative) and decimals separated by spaces. Finally, putting all of this together into a script is beyond my reach given my current knowledge of these tools. I am okay if all of this can be done in awk
or .sh
itself.
Upvotes: 1
Views: 446
Reputation: 2821
The regex i use to throw away empty lines (so a whole row filled with spaces and tabs, even unicode variants, constitute an empty line in this case) just do
mawk1.3.4 'BEGIN { FS = "^$" } /[[:graph:]]/'
FS = "^$" to prevent it from wasting CPU splitting fields you don't need.
Word of caution - stick with mawk 1.3 instead.
***ps :
reason for striking gnu-awk here is that despite gawk and mawk2 matching each other on /[[:graph:]]/, some of my internal testing has realized that both would drop a bunch of korean hangul, and some emojis in the 4-byte unicode space.
only mawk1.3.4 seems to correctly account for them.
ps2 :
FS = "^$" is faster than FS = RS
Upvotes: 0
Reputation: 50775
You can use awk to remove blank lines and lines that contain letters, trim leading and trailing spaces, and squeeze spaces between words as well.
# selects *.txt minus mod*.txt
find . -name '*.txt' ! -name 'mod*' -exec awk '
FNR == 1 {
close(fn)
fn = FILENAME
sub(/.*\//, "&mod", fn)
}
/[[:alpha:]]/ { next }
NF { $1 = $1; print > fn }' {} +
Wrt how $1 = $1
works, see Ed's answer here.
Upvotes: 4
Reputation: 14854
While it can be done with awk
as shown in the accepted answer, it can also be done with Perl:
find . -name '*.txt' -exec perl -i.bak -nle '
next unless ( /^[\s\d\.\-]+$/ && /\d/ ); # skip unwanted lines
s/\s+/ /g; # keep only single spaces
s/^\s+|\s+$//g; # trim whitespace at start and end
print' {} +
This uses -i.bak
to do inplace replacement, saving your original files with a .bak
extension.
The -l
option adds a newline, because we trimmed any whitespace characters from the end (also removing \r (CR) characters in case the files came from Windows)
If it's important to keep the original file names, you could do something like this afterwards
find . -name "*.txt.bak" -print0 \
| while IFS= read -r -d '' f; do
mv "${f%%.bak}" "${f%%.txt.bak}-new.txt";
mv "$f" "${f%%.bak}"
done
Upvotes: 1
Reputation: 66
Here is a similar version using sed.
find -name \*.txt -print -exec sh -c "sed -r '/(^\s*$|[[:alpha:]])/d ; s/\s+/ /g ; s/(^\s|\s$)//g' '{}' > '{}.mod'" \;
There is a small issue with naming the new files as modfile.txt . The next time you run it, it will process modfile.txt and create modmodfile.txt . Adding a .mod suffix will prevent the modified files from being processed.
/(^\s*$|[[:alpha:]])/d # delete blank lines or lines with alpha
s/\s+/ /g # replace multiple spaces with one space
s/(^\s|\s$)//g # replace space at the beginning or end of the line with nothing
Upvotes: 1