Reputation: 565
I have the following, it's ignoring the lines with just # but not those with \n (empty/ just containing newline lines)
Do you know of a way I can hit two birds with one stone? I.E. if the lines don't contain more than 1 char, then delete the line..
function check_duplicates {
awk '
FNR==1{files[FILENAME]}
{if((FILENAME, $0) in a) dupsInFile[FILENAME]
else
{a[FILENAME, $0]
dups[$0] = $0 in dups ? (dups[$0] RS FILENAME) : FILENAME
count[$0]++}}
{if ($0 ~ /#/) {
delete dups[$0]
}}
#Print duplicates in more than one file
END{for(k in dups)
{if(count[k] > 1)
{print ("\n\nDuplicate line found: " k) " - In the following file(s)"
print dups[k] }}
printf "\n";
}' $SITEFILES
awk '
NR {
b[$0]++
}
$0 in b {
if ($0 ~ /#/) {
delete b[$0]
}
if (b[$0]>1) {
print ("\n\nRepeated line found: "$0) " - In the following file"
print FILENAME
delete b[$0]
}
}' $SITEFILES
}
The expected input is usually as follows.
#File Path's
/path/to/file1
/path/to/file2
/path/to/file3
/path/to/file4
#
/more/paths/to/file1
/more/paths/to/file2
/more/paths/to/file3
/more/paths/to/file4
/more/paths/to/file5
/more/paths/to/file5
In this case, /more/paths/to/file5, occurs twice, and should be flagged as such.
However, there are also many newlines, which I'd rather ignore.
Er, it also has to be awk, I'm doing a tonne of post processing, and don't want to vary from awk for this bit, if that's okay :)
It really seems to be a bit tougher than I would have expected.
Cheers, Ben
Upvotes: 0
Views: 1211
Reputation: 26667
You can combine both the if
into a single regex.
if ($0 ~ /#|\n/) {
delete dups[$0]
}
OR
To be more specific you can write
if ($0 ~ /^#?$/) {
delete dups[$0]
}
What it does
^
Matches starting of the line.
#?
Matches one or zero #
$
Matches end of line.
So, ^$
matches empty lines and ^#$
matches lines with only #
.
Upvotes: 2