Reputation: 105
I have a string containing duplicate words, for example:
abc, def, abc, def
How can I remove the duplicates? The string that I need is:
abc, def
Upvotes: 7
Views: 8025
Reputation: 314
The problem with an associative array or xargs and sort in the other examples is, that the words become sorted. My solution only skips words that already have been processed. The associative array map keeps this information.
Bash function
function uniq_words() {
local string="$1"
local delimiter=", "
local words=""
declare -A map
while read -r word; do
# skip already processed words
if [ ! -z "${map[$word]}" ]; then
continue
fi
# mark the found word
map[$word]=1
# don't add a delimiter, if it is the first word
if [ -z "$words" ]; then
words=$word
continue
fi
# add a delimiter and the word
words="$words$delimiter$word"
# split the string into lines so that we don't have
# to overwrite the $IFS system field separator
done <<< $(sed -e "s/$delimiter/\n/g" <<< "$string")
echo ${words}
}
Example 1
uniq_words "abc, def, abc, def"
Output:
abc, def
Example 2
uniq_words "1, 2, 3, 2, 1, 0"
Output:
1, 2, 3, 0
Example with xargs and sort
In this example, the output is sorted.
echo "1 2 3 2 1 0" | xargs -n1 | sort -u | xargs | sed "s# #, #g"
Output:
0, 1, 2, 3
Upvotes: 0
Reputation: 20843
This can also be done in pure Bash:
#!/bin/bash
string="abc, def, abc, def"
declare -A words
IFS=", "
for w in $string; do
words+=( [$w]="" )
done
echo ${!words[@]}
Output
def abc
Explanation
words
is an associative array (declare -A words
) and every word is added as
a key to it:
words+=( [${w}]="" )
(We do not need its value therefore I have taken ""
as value).
The list of unique words is the list of keys (${!words[@]}
).
There is one caveat thought, the output is not separated by ", "
. (You will
have to iterate again. IFS
is only used with ${words[*]}
and even than only
the first character of IFS
is used.)
Upvotes: 2
Reputation: 105
I have another way for this case. I changed my input string such as below and run command to editing it:
#string="abc def abc def"
$ echo "abc def abc def" | xargs -n1 | sort -u | xargs | sed "s# #, #g"
abc, def
Thanks for all support!
Upvotes: 1
Reputation: 22428
You can use awk
to do this.
Example:
#!/bin/bash
string="abc, def, abc, def"
string=$(printf '%s\n' "$string" | awk -v RS='[,[:space:]]+' '!a[$0]++{printf "%s%s", $0, RT}')
string="${string%,*}"
echo "$string"
Output:
abc, def
Upvotes: 3
Reputation: 113834
We have this test file:
$ cat file
abc, def, abc, def
To remove duplicate words:
$ sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//' file
abc, def
:a
This defines a label a
.
s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g
This looks for a duplicated word consisting of alphanumeric characters and removes the second occurrence.
ta
If the last substitution command resulted in a change, this jumps back to label a
to try again.
In this way, the code keeps looking for duplicates until none remain.
s/(, )+/, /g; s/, *$//
These two substitution commands clean up any left over comma-space combinations.
For Mac OSX or other BSD system, try:
sed -E -e ':a' -e 's/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g' -e 'ta' -e 's/(, )+/, /g' -e 's/, *$//' file
sed easily handles input either from a file, as shown above, or from a shell string as shown below:
$ echo 'ab, cd, cd, ab, ef' | sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//'
ab, cd, ef
Upvotes: 8