Reputation: 477

Deleting duplicated fields within record

enter code hereI want to delete duplicate instances of the key (the first 2 fields) in each record. For my specific the duplicates actually appear reversed.

So given

a b b a stuff1 b a stuff2 stuff3 b a

where each space is a tab

I want:

a b stuff1 stuff2 stuff3

and I thought this would do it:

awk 'BEGIN {FS=OFS="\t"} 
     {gsub($2 "\t" $1,"")}
     1' file

Alternative solutions are welcome but I am particularly interested in why that does not work
(I have tried it with a dynamic regexp and with gensub btw).

Per a previous question I am aware that I may/will end up with duplicate tabs and will take care of that outside awk.

EDIT

Solutions so far don't work so here is the real data. For ^ read a tab character

1874    ^Passage de Venus^  <DIRECTORS> ^Passage de Venus^  1874^   Janssen, P.J.C.^    <keywords>^ Passage de Venus^   1874^   astronomy^  astrophotography^   <genres>^   Short

What I want is

1874^   Passage de Venus^   <DIRECTORS>^    Janssen, P.J.C.^    <keywords>^ astronomy^  astrophotography^   <genres>^   Short

Upvotes: 2

Answers (4)

user3442743

Reputation:

You could try this

awk '{gsub($2 "[[:space:]]+" $1, "")}1' file

If this works and using "\t" doesn't you probably aren't using tabs .

Checked again there is no bug it is likely you just have space in you file next to the tabs

Try

awk 'BEGIN{FS=" *\t *"}{gsub($2 FS $1, "")}1' file

Although this answer was purely meant to be for troubleshooting whyt gsub was not working, i have decided to add this addendum for Eds concerns in the comments

This will stop words other than exactly $2 then $1 being matched, and should also sort out the formatting messing up

awk 'BEGIN{FS=" *\t *"}{$0=gensub("("FS")" $2 FS $1 "("FS")","\\1","g")}1' file

Example

 Input
 1234    mal     mal     1234    formal  12345678        blah

 Output
 1234    mal     formal  12345678        blah

This should be more robust again even with metachars

awk -F' *\t *' '{x=y;for(i=1;i<=NF;i++)(i>2&&$i==$2&&$(i+1)==$1&&i++)||x=x?x"\t"$i:$i;$0=x}1' file

Upvotes: 2

Ed Morton

Reputation: 204648

This is the kind of solution you really need as it does string comparisons on full fields and so will not falsely match when fields contain RE metacharacters or when fields start or end with the same values as $1/$2:

awk -F' *\t *' -v OFS='\t' '{
    rec = $1 OFS $2
    for (i=3; i<=NF; i++) {
        if ( ($i == $2) && ($(i+1) == $1) ) {
            i += 2
        }
        else {
            rec = rec OFS $i
        }
    }
    print rec
}
' file

Its untested as idk if you'll care about the solution being robust or not - test yourself and massage to suit....

Upvotes: 1

fedorqui

Reputation: 290515

Your attempt was good, you probably have some problems with spaaces/tabs. Also, you may want to use FS to make it more changeable:

awk 'BEGIN {FS=OFS="\t"} {gsub($2 FS $1, "")}1' file
             |____________________^

So if you notice that the field separator is another one, just change it in your BEGIN block and it will work fine.

Test

$ cat a
a b b a stuff1 b a stuff2 stuff3 b a
$ awk '{gsub($2 FS $1, "")}1' a
a b  stuff1  stuff2 stuff3

Upvotes: 4

nu11p01n73R

Reputation: 26687

The only problem I can think of is that your input file is not delimited by tabs

Test

$ echo "a b b a stuff1 b a stuff2 stuff3 b a" | awk  '{gsub($2" "$1,"")}1'
a b  stuff1  stuff2 stuff3

Upvotes: 0

Deleting duplicated fields within record

Answers (4)

Example

Test

Related Questions