Reputation: 477
enter code here
I want to delete duplicate instances of the key (the first 2 fields) in each record. For my specific the duplicates actually appear reversed.
So given
a b b a stuff1 b a stuff2 stuff3 b a
where each space is a tab
I want:
a b stuff1 stuff2 stuff3
and I thought this would do it:
awk 'BEGIN {FS=OFS="\t"}
{gsub($2 "\t" $1,"")}
1' file
Alternative solutions are welcome but I am particularly interested in why that does not work
(I have tried it with a dynamic regexp and with gensub
btw).
Per a previous question I am aware that I may/will end up with duplicate tabs and will take care of that outside awk
.
EDIT
Solutions so far don't work so here is the real data. For ^ read a tab character
1874 ^Passage de Venus^ <DIRECTORS> ^Passage de Venus^ 1874^ Janssen, P.J.C.^ <keywords>^ Passage de Venus^ 1874^ astronomy^ astrophotography^ <genres>^ Short
What I want is
1874^ Passage de Venus^ <DIRECTORS>^ Janssen, P.J.C.^ <keywords>^ astronomy^ astrophotography^ <genres>^ Short
Upvotes: 2
Views: 114
Reputation:
You could try this
awk '{gsub($2 "[[:space:]]+" $1, "")}1' file
If this works and using "\t"
doesn't you probably aren't using tabs .
Checked again there is no bug it is likely you just have space in you file next to the tabs
Try
awk 'BEGIN{FS=" *\t *"}{gsub($2 FS $1, "")}1' file
Although this answer was purely meant to be for troubleshooting whyt gsub was not working, i have decided to add this addendum for Eds concerns in the comments
This will stop words other than exactly $2
then $1
being matched, and should also sort out the formatting messing up
awk 'BEGIN{FS=" *\t *"}{$0=gensub("("FS")" $2 FS $1 "("FS")","\\1","g")}1' file
Input
1234 mal mal 1234 formal 12345678 blah
Output
1234 mal formal 12345678 blah
This should be more robust again even with metachars
awk -F' *\t *' '{x=y;for(i=1;i<=NF;i++)(i>2&&$i==$2&&$(i+1)==$1&&i++)||x=x?x"\t"$i:$i;$0=x}1' file
Upvotes: 2
Reputation: 204648
This is the kind of solution you really need as it does string comparisons on full fields and so will not falsely match when fields contain RE metacharacters or when fields start or end with the same values as $1/$2:
awk -F' *\t *' -v OFS='\t' '{
rec = $1 OFS $2
for (i=3; i<=NF; i++) {
if ( ($i == $2) && ($(i+1) == $1) ) {
i += 2
}
else {
rec = rec OFS $i
}
}
print rec
}
' file
Its untested as idk if you'll care about the solution being robust or not - test yourself and massage to suit....
Upvotes: 1
Reputation: 290515
Your attempt was good, you probably have some problems with spaaces/tabs. Also, you may want to use FS
to make it more changeable:
awk 'BEGIN {FS=OFS="\t"} {gsub($2 FS $1, "")}1' file
|____________________^
So if you notice that the field separator is another one, just change it in your BEGIN
block and it will work fine.
$ cat a
a b b a stuff1 b a stuff2 stuff3 b a
$ awk '{gsub($2 FS $1, "")}1' a
a b stuff1 stuff2 stuff3
Upvotes: 4
Reputation: 26687
The only problem I can think of is that your input file is not delimited by tabs
Test
$ echo "a b b a stuff1 b a stuff2 stuff3 b a" | awk '{gsub($2" "$1,"")}1'
a b stuff1 stuff2 stuff3
Upvotes: 0