Find and Remove Lines in a file by comparing multiple strings

Question

I have the following file:

SOME TEXT AT START OF FILE
    STRING1 SMALL
    STRING2 SMALL
    STRING1 MEDIUM
    STRING3 LARGE
    STRING2 XLG
SOME TEXT TO SEPARATE LISTS
    STRING4 SMALL
    STRING1 MEDIUM
    STRING1 SMALL
    STRING5 LARGE
    STRING6 SMALL
SOME MORE TEXT TO SEPARATE LISTS
    ANOTHER LIST
...

For each list, I only want to keep the largest (S,M,L,XL) occurrence of each string so that the result would look like this:

SOME TEXT AT START OF FILE
    STRING1 MEDIUM
    STRING3 LARGE
    STRING2 XLG
SOME TEXT TO SEPARATE LISTS
    STRING4 SMALL
    STRING1 MEDIUM
    STRING5 LARGE
    STRING6 SMALL
SOME MORE TEXT TO SEPARATE LISTS
    ANOTHER LIST
...

I have no idea how to do this. Please help. I am trying to do this in a bash script through terminal on a mac.

I also need to modify another similar list

TEXT
    STRING1
    STRING2
    STRING3
    STRING1
TEXT
    STRING4
    STRING1
TEXT
    STRING5
    STRING2
    STRING5
ETC...

How do I eliminate the duplicate strings in this case? I was going to try to use awk '!seen[$0]++' filename, however this would remove the string from each list instead of looking at each list separately.

oguz ismail · Accepted Answer

For your first question

$ cat tst.awk
BEGIN {
    sz["SMALL"]  = 0
    sz["MEDIUM"] = 1
    sz["LARGE"]  = 2
    sz["XLG"]    = 3
}

/^[^ ]/ {
    dump()
    delete data
    print
    next
}

!($1 in data) || sz[data[$1]] < sz[$2] {
    data[$1] = $2
}

END {
    dump()
}

function dump(k) {
    for (k in data)
        print "    " k " " data[k]
}
$
$ awk -f tst.awk file
SOME TEXT AT START OF FILE
    STRING1 MEDIUM
    STRING2 XLG
    STRING3 LARGE
SOME TEXT TO SEPARATE LISTS
    STRING4 SMALL
    STRING5 LARGE
    STRING6 SMALL
    STRING1 MEDIUM
SOME MORE TEXT TO SEPARATE LISTS
    ANOTHER LIST
...

And for the second one

awk '/^[^ ]/{delete seen}!seen[$0]++' file

Find and Remove Lines in a file by comparing multiple strings

Answers (1)

Related Questions