Reputation: 2581
I want to remove duplicate words/strings from a large tab separated file using Linux commands.
names john, cnn, mac, tommy, mac, patrick, ngc, discovery, john, cnn, adam, patrick
cities san jose, santa clara, san franscisco, new york, san jose, santa clara
The above is the file format, I want to retain the tabs and commas after removing the duplicate words.
names john, cnn, mac, tommy, patrick, ngc, discovery, adam
cities san jose, santa clara, san franscisco, new york
Any help would be appreciated.
Upvotes: 3
Views: 1233
Reputation: 58578
This might work for you:
sed -i ':a;s/\(\(\<[^,]*\),.*\)\( \2,*\)/\1/;ta;s/,$//' /tmp/a
Upvotes: 0
Reputation: 360733
awk 'BEGIN {
FS = ", |\t"
}
{
printf "%s\t", $1
delim = ""
for (i = 2; i <= NF; i++) {
if (! ($i in seen)) {
printf "%s%s", delim, $i
delim = ", "
}
seen[$i]
}
printf "\n"
delete seen
}' inputfile
If you're not using GNU AWK (gawk
) then you can't delete
the array, use split("", array)
instead.
Upvotes: 3
Reputation: 35505
sed
and awk
by themselves aren't particularly well suited for this. uniq
is better.
First pull out the names into another file, say names
. You can use sed for this:
head -1 inputfile | sed 's/^names\s*//g' > names
So now names contains john, cnn, mac, tommy, mac, patrick, ngc, discovery, john, cnn, adam, patrick
.
Then use this:
awk 'BEGIN{RS=","}{print $0}' names | sort | uniq | awk 'BEGIN{ORS=","}{print $0}'
Output is adam,cnn,discovery,john,mac,ngc,patrick,tommy,
. You can remove the last comma also if you want using sed
. Of course you can pipe the output of the head
command to the second awk
also. In that case, you won't need the intermediate names
file.
Same for cities. I am assuming order is not important for you.
Upvotes: 2