Reputation: 75

remove the first duplicate line based on a matched field and keep the second matched line

Input file has 3 fields. Each field separated by a | (PIPE).
First field is the key field and sorted. Each key in first field may occur once or twice.
If a same key exists twice in first field, then remove the line of first occurrence and do not remove the line of second occurrence.
If a key occurs only once then do not remove the line.
Input Data in the third field will be unique through out the file.

Tried the below command which keeps the first duplicate line and removes the rest of duplicate lines. Is there any option in awk command to remove the first matched duplicate line and keep the second matched line. Command other than awk is also okay. Input file size can be 50 GB size. I am testing now on 12 GB file.

awk -F'|' '!a[$1]++'

Input File Content:

1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "xsdsyzsgngn"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "xynfnfnnnz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}
....

Output expected after processing the input file:

1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}
....

EDIT

Tried below solutions provided by @RavinderSingh13 & @RomanPerekhrest repectively.

For 12GB input file, below solution took 1 minute 20 seconds in first run and 1 minute 46 seconds in second run:

awk '
BEGIN{
  FS="|"
}
!a[$1]++{
  b[++count]=$1
}
{
  c[$1]=$0
}
END{
  for(i=1;i<=count;i++){
    print c[b[i]]
  }
}
' Inputfile  > testawk.txt

For 12GB input file, below solution took 2 minutes 31 seconds in first run, 4 minutes 43 seconds in second run and 2 minutes in 3rd run:

awk -F'|' 'prev && $1 != prev{ print row }{ prev=$1; row=$0 }END{ print row }' Inputfile > testawk2.txt

Both the solutions are working as expected. I will use any one of the above after doing few more performance tests.

Upvotes: 2

Answers (4)

RomanPerekhrest

Reputation: 92904

Efficiently with awk expression:

awk -F'|' 'prev && $1 != prev{ print row }{ prev=$1; row=$0 }END{ print row }' file

The "magic" is based on capturing each current record (efficiently overwriting it without constant accumulation) and performing analysis on next row.

Sample output:

1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}

Upvotes: 2

KamilCuk

Reputation: 142005

Reverse the file and stable unique sort:

cat <<EOF |
1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "xsdsyzsgngn"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "xynfnfnnnz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}
EOF
tac | sort -s -t'|' -k1,1 -u

would output:

1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}

The tac is a GNU utility. Because your file is big, pass the filename to tac so it can read the file from the back and use -T, --temporary-directory=DIR option with sort to allow it to sort such big files (or not, if you have enough ram).

Upvotes: 0

RavinderSingh13

Reputation: 133770

1st solution: If you are not at all worried about order of your lines in output then do simply.

awk 'BEGIN{FS="|"} {a[$1]=$0} END{for(i in a){print a[i]}}' Input_file

2nd solution: Adding 1 more solution with awk less arrays and sort in case you worried about order.

awk 'BEGIN{FS="|"} {a[$1]=$0} END{for(i in a){print a[i]}}' Input_file | sort -t'|' -k1

3rd solution: Could you please try following. If you are worried about order of your output should be same as shown Input_file.

awk '
BEGIN{
  FS="|"
}
!a[$1]++{
  b[++count]=$1
}
{
  c[$1]=$0
}
END{
  for(i=1;i<=count;i++){
    print c[b[i]]
  }
}
'  Input_file

Output will be as follows.

1|xxx|{name: "xyz"}
2|xxx|{name: "abcfgs"}
3|xxx|{name: "egg"}
4|xxx|{name: "eggrgg"}
5|xxx|{name: "gbgnfxyz"}
6|xxx|{name: "xyz"}
7|xxx|{name: "bvbv"}
8|xxx|{name: "xyz"}
9|xxx|{name: "xyz"}

Upvotes: 1

Kent

Reputation: 195269

This one-liner will only remove the first duplicate (the 2nd occurrence) from your file.

awk 'a[$1]++ !=1' file

Let's see an example:

kent$  cat f
1
2
3
2 <- should be removed
4
3 <- should be removed
5
6
7
8
9
2 <- should be kept
3 <- should be kept
10

kent$  awk 'a[$1]++ !=1' f
1
2
3
4
5
6
7
8
9
2
3
10

Upvotes: 0

remove the first duplicate line based on a matched field and keep the second matched line

Answers (4)

Related Questions