Kalpana Pinninty
Kalpana Pinninty

Reputation: 59

Awk or Sed commands to remove duplicates from the CSV file

I do have the generated CSV file which contains the duplicate values .I would like to delete/remove that duplicate values using AWK or Sed commands.

Actual output

10.135.83.48,9042
10.135.83.46,9042
10.135.83.44,9042
10.5.197.25,10334
10.39.8.166,1500
10.135.83.48,9042
10.135.83.46,9042
10.135.83.44,9042
https://t-mobile.com,443
https://t-mobile.com,443
http://localhost:5059/abc/token,80

Expected output

  10.135.83.48,9042
    10.135.83.46,9042
    10.135.83.44,9042
    10.5.197.25,10334
    10.39.8.166,1500
https://t-mobile.com,443
http://localhost:5059/abc/token,80

From few of property files i got this output. Below is the script which i am trying

#!/bin/bash
for file in $(ls); 
do 
#echo  " --$file -- "; 
grep -P  '((?<=[^0-9.]|^)[1-9][0-9]{0,2}(\.([0-9]{0,3})){3}(?=[^0-9.]|$)|(http|ftp|https|ftps|sftp)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/+#-]*[\w@?^=%&/+#-])?|\.port|\.host|contact-points|\.uri|\.endpoint)' $file|grep '^[^#]' |awk '{split($0,a,"#"); print a[1]}'|awk '{split($0,a,"="); print a[1],a[2]}'|sed 's/^\|#/,/g'|awk '/http:\/\//  {print $2,80}
       /https:\/\// {print $2,443}
       /Points/     {print $2,"9042"}
       /host/       {h=$2}
       /port/       {print h,$2; h=""}'|awk -F'[, ]' '{for(i=1;i<NF;i++){print $i,$NF}}'|awk 'BEGIN{OFS=","} {$1=$1} 1'|sed '/^[0-9]*$/d'|awk -F, '$1 != $2' 
done |awk '!a[$0]++' 
#echo "Done."
stty echo
cd ..

awk '!a[$0]++' --> This is the command i am trying to combine with above script. Individually this command is working.But when i am trying to combine with the above script it is not working as expected.

Thanks for your help in advance.

Upvotes: 1

Views: 927

Answers (3)

potong
potong

Reputation: 58371

This might work for you (GNU sed):

sed -E 'H;x;s/((\n[^\n]+)(\n.*)*)\2$/\1/;x;$!d;x;s/.//' file1

Append the current line to the hold space (HS) and if it is a duplicate, remove it.

At the end of the file, swap to the HS, remove the first character (which is a newline artifact) and print the result.

N.B. This removes duplicates but retains original order.

Upvotes: 1

David C. Rankin
David C. Rankin

Reputation: 84531

The simplest way to approach this (or one of the simplest) is to keep an array indexed by the records that have been seen. If the records isn't in the seen array, add it and print the record. If it is, just skip the record, e.g.

awk '$0 in seen{next}; {seen[$0]++}1' file

Example Use/Output

With your input in the file named dupes, you would have:

$ awk '$0 in seen{next}; {seen[$0]++}1' dupes
10.135.83.48,9042
10.135.83.46,9042
10.135.83.44,9042
10.5.197.25,10334
10.39.8.166,1500
https://t-mobile.com,443
http://localhost:5059/abc/token,80

Upvotes: 1

Digvijay S
Digvijay S

Reputation: 2705

Try

#!/bin/bash
for file in *; 
do 
#echo  " --$file -- "; 
grep -P  '((?<=[^0-9.]|^)[1-9][0-9]{0,2}(\.([0-9]{0,3})){3}(?=[^0-9.]|$)|(http|ftp|https|ftps|sftp)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/+#-]*[\w@?^=%&/+#-])?|\.port|\.host|contact-points|\.uri|\.endpoint)' $file|grep '^[^#]' |awk '{split($0,a,"#"); print a[1]}'|awk '{split($0,a,"="); print a[1],a[2]}'|sed 's/^\|#/,/g'|awk '/http:\/\//  {print $2,80}
       /https:\/\// {print $2,443}
       /Points/     {print $2,"9042"}
       /host/       {h=$2}
       /port/       {print h,$2; h=""}'|awk -F'[, ]' '{for(i=1;i<NF;i++){print $i,$NF}}'|awk 'BEGIN{OFS=","} {$1=$1} 1'|sed '/^[0-9]*$/d'|awk -F, '$1 != $2' | awk '!a[$0]++'  
done 
#echo "Done."
stty echo
cd ..

Upvotes: 1

Related Questions