Thanh Tran
Thanh Tran

Reputation: 105

How to remove duplicate words from a string in a Bash script?

I have a string containing duplicate words, for example:

abc, def, abc, def

How can I remove the duplicates? The string that I need is:

abc, def

Upvotes: 7

Views: 8025

Answers (5)

Rabbit
Rabbit

Reputation: 314

The problem with an associative array or xargs and sort in the other examples is, that the words become sorted. My solution only skips words that already have been processed. The associative array map keeps this information.

Bash function

function uniq_words() {

  local string="$1"
  local delimiter=", "  
  local words=""

  declare -A map

  while read -r word; do
    # skip already processed words
    if [ ! -z "${map[$word]}" ]; then
      continue
    fi

    # mark the found word
    map[$word]=1

    # don't add a delimiter, if it is the first word
    if [ -z "$words" ]; then
      words=$word
      continue
    fi

    # add a delimiter and the word
    words="$words$delimiter$word"

  # split the string into lines so that we don't have
  # to overwrite the $IFS system field separator
  done <<< $(sed -e "s/$delimiter/\n/g" <<< "$string")

  echo ${words}
}

Example 1

uniq_words "abc, def, abc, def"

Output:

abc, def

Example 2

uniq_words "1, 2, 3, 2, 1, 0"

Output:

1, 2, 3, 0

Example with xargs and sort

In this example, the output is sorted.

echo "1 2 3 2 1 0" | xargs -n1 | sort -u | xargs |  sed "s# #, #g"

Output:

0, 1, 2, 3

Upvotes: 0

Micha Wiedenmann
Micha Wiedenmann

Reputation: 20843

This can also be done in pure Bash:

#!/bin/bash

string="abc, def, abc, def"

declare -A words

IFS=", "
for w in $string; do
  words+=( [$w]="" )
done

echo ${!words[@]}

Output

def abc

Explanation

words is an associative array (declare -A words) and every word is added as a key to it:

words+=( [${w}]="" )

(We do not need its value therefore I have taken "" as value).

The list of unique words is the list of keys (${!words[@]}).

There is one caveat thought, the output is not separated by ", ". (You will have to iterate again. IFS is only used with ${words[*]} and even than only the first character of IFS is used.)

Upvotes: 2

Thanh Tran
Thanh Tran

Reputation: 105

I have another way for this case. I changed my input string such as below and run command to editing it:

#string="abc def abc def"
$ echo "abc def abc def" | xargs -n1 | sort -u | xargs |  sed "s# #, #g"
abc, def

Thanks for all support!

Upvotes: 1

Jahid
Jahid

Reputation: 22428

You can use awk to do this.

Example:

#!/bin/bash
string="abc, def, abc, def"
string=$(printf '%s\n' "$string" | awk -v RS='[,[:space:]]+' '!a[$0]++{printf "%s%s", $0, RT}')
string="${string%,*}"
echo "$string"

Output:

abc, def

Upvotes: 3

John1024
John1024

Reputation: 113834

We have this test file:

$ cat file
abc, def, abc, def

To remove duplicate words:

$ sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//' file
abc, def

How it works

  • :a

    This defines a label a.

  • s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g

    This looks for a duplicated word consisting of alphanumeric characters and removes the second occurrence.

  • ta

    If the last substitution command resulted in a change, this jumps back to label a to try again.

    In this way, the code keeps looking for duplicates until none remain.

  • s/(, )+/, /g; s/, *$//

    These two substitution commands clean up any left over comma-space combinations.

Mac OSX or other BSD System

For Mac OSX or other BSD system, try:

sed -E -e ':a' -e 's/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g' -e 'ta' -e 's/(, )+/, /g' -e 's/, *$//' file

Using a string instead of a file

sed easily handles input either from a file, as shown above, or from a shell string as shown below:

$ echo 'ab, cd, cd, ab, ef' | sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//'
ab, cd, ef

Upvotes: 8

Related Questions