Reputation: 2581

Remove duplicate words/string from a tab separated file

I want to remove duplicate words/strings from a large tab separated file using Linux commands.

names            john, cnn, mac, tommy, mac, patrick, ngc, discovery, john, cnn, adam, patrick
cities            san jose, santa clara, san franscisco, new york, san jose, santa clara

The above is the file format, I want to retain the tabs and commas after removing the duplicate words.

names            john, cnn, mac, tommy, patrick, ngc, discovery, adam
cities            san jose, santa clara, san franscisco, new york

Any help would be appreciated.

Upvotes: 3

Answers (3)

potong

Reputation: 58578

This might work for you:

sed -i ':a;s/\(\(\<[^,]*\),.*\)\( \2,*\)/\1/;ta;s/,$//' /tmp/a

Upvotes: 0

Dennis Williamson

Reputation: 360733

awk 'BEGIN {
         FS = ", |\t"
     }
     {
          printf "%s\t", $1
          delim = ""
          for (i = 2; i <= NF; i++) {
              if (! ($i in seen)) {
                  printf "%s%s", delim, $i
                  delim = ", "
              }
              seen[$i]
          }
          printf "\n"
          delete seen
     }' inputfile

If you're not using GNU AWK (gawk) then you can't delete the array, use split("", array) instead.

Upvotes: 3

Hari Menon

Reputation: 35505

sed and awk by themselves aren't particularly well suited for this. uniq is better.

First pull out the names into another file, say names. You can use sed for this:

head -1 inputfile | sed 's/^names\s*//g' > names

So now names contains john, cnn, mac, tommy, mac, patrick, ngc, discovery, john, cnn, adam, patrick.

Then use this:

awk 'BEGIN{RS=","}{print $0}' names | sort | uniq | awk 'BEGIN{ORS=","}{print $0}'

Output is adam,cnn,discovery,john,mac,ngc,patrick,tommy,. You can remove the last comma also if you want using sed. Of course you can pipe the output of the head command to the second awk also. In that case, you won't need the intermediate names file.

Same for cities. I am assuming order is not important for you.

Upvotes: 2

Remove duplicate words/string from a tab separated file

Answers (3)

Related Questions