Reputation: 75
Tried the below command which keeps the first duplicate line and removes the rest of duplicate lines. Is there any option in awk command to remove the first matched duplicate line and keep the second matched line. Command other than awk is also okay. Input file size can be 50 GB size. I am testing now on 12 GB file.
awk -F'|' '!a[$1]++'
Input File Content:
1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "xsdsyzsgngn"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "xynfnfnnnz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}
....
Output expected after processing the input file:
1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}
....
EDIT
Tried below solutions provided by @RavinderSingh13 & @RomanPerekhrest repectively.
For 12GB input file, below solution took 1 minute 20 seconds in first run and 1 minute 46 seconds in second run:
awk '
BEGIN{
FS="|"
}
!a[$1]++{
b[++count]=$1
}
{
c[$1]=$0
}
END{
for(i=1;i<=count;i++){
print c[b[i]]
}
}
' Inputfile > testawk.txt
For 12GB input file, below solution took 2 minutes 31 seconds in first run, 4 minutes 43 seconds in second run and 2 minutes in 3rd run:
awk -F'|' 'prev && $1 != prev{ print row }{ prev=$1; row=$0 }END{ print row }' Inputfile > testawk2.txt
Both the solutions are working as expected. I will use any one of the above after doing few more performance tests.
Upvotes: 2
Views: 147
Reputation: 92904
Efficiently with awk
expression:
awk -F'|' 'prev && $1 != prev{ print row }{ prev=$1; row=$0 }END{ print row }' file
The "magic" is based on capturing each current record (efficiently overwriting it without constant accumulation) and performing analysis on next row.
Sample output:
1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}
Upvotes: 2
Reputation: 142005
Reverse the file and stable unique sort:
cat <<EOF |
1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "xsdsyzsgngn"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "xynfnfnnnz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}
EOF
tac | sort -s -t'|' -k1,1 -u
would output:
1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}
The tac
is a GNU utility. Because your file is big, pass the filename to tac
so it can read the file from the back and use -T, --temporary-directory=DIR
option with sort
to allow it to sort such big files (or not, if you have enough ram).
Upvotes: 0
Reputation: 133770
1st solution: If you are not at all worried about order of your lines in output then do simply.
awk 'BEGIN{FS="|"} {a[$1]=$0} END{for(i in a){print a[i]}}' Input_file
2nd solution: Adding 1 more solution with awk
less arrays and sort
in case you worried about order.
awk 'BEGIN{FS="|"} {a[$1]=$0} END{for(i in a){print a[i]}}' Input_file | sort -t'|' -k1
3rd solution: Could you please try following. If you are worried about order of your output should be same as shown Input_file.
awk '
BEGIN{
FS="|"
}
!a[$1]++{
b[++count]=$1
}
{
c[$1]=$0
}
END{
for(i=1;i<=count;i++){
print c[b[i]]
}
}
' Input_file
Output will be as follows.
1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}
Upvotes: 1
Reputation: 195269
This one-liner will only remove the first duplicate (the 2nd occurrence) from your file.
awk 'a[$1]++ !=1' file
Let's see an example:
kent$ cat f
1
2
3
2 <- should be removed
4
3 <- should be removed
5
6
7
8
9
2 <- should be kept
3 <- should be kept
10
kent$ awk 'a[$1]++ !=1' f
1
2
3
4
5
6
7
8
9
2
3
10
Upvotes: 0