Reputation: 694

How can I display unique words contained in a Bash string?

I have a string that has duplicate words. I would like to display only the unique words. The string is:

variable="alpha bravo charlie alpha delta echo charlie"

I know several tools that can do this together. This is what I figured out:

echo $variable | tr " " "\n" | sort -u | tr "\n" " "

What is a more effective way to do this?

Upvotes: 13

Answers (8)

FGrose

Reputation: 51

With sed:

Pad the input string with a space before and after, " $word_bag " below,
Remove duplicates, (\S+) is \2, separated by whitespace, repeatedly,
Remove padding.

word_set=$(sed -E ':w s/(\s(\S+)\s.*)\2\s/\1/;tw; s/^\s+//; s/\s+$//' <<< " $word_bag ")

(Does not scale to very very long inputs.)

Upvotes: 0

mklement0

Reputation: 437478

^{Note: This solution assumes that all unique words should be output in the order they're encountered in the input. By contrast, the OP's own solution attempt outputs a sorted list of unique words.}

A simple Awk-only solution (POSIX-compliant) that is efficient by avoiding a pipeline (which invariably involves subshells).

awk -v RS=' ' '{ if (!seen[$1]++) { printf "%s%s",sep,$1; sep=" " } }' <<<"$variable"

# The above prints without a trailing \n, as in the OP's own solution.
# To add a trailing newline, append  `END { print }` to the end 
# of the Awk script.

Note how $variable is double-quoted to prevent it from accidental shell expansions, notably pathname expansion (globbing), and how it is provided to Awk via a here-string (<<<).
-v RS=' ' tells Awk to split the input into records by a single space.
- Note that the last word will have the input line's trailing newline included, which is why we don't use $0 - the entire record - but $1, the record's first field, which has the newline stripped due to Awk's default field-splitting behavior.
seen[$1]++ is a common Awk idiom that either creates an entry for $1, the input word, in associative array seen, if it doesn't exist yet, or increments its occurrence count.
!seen[$0]++ therefore only returns true for the first occurrence of a given word (where seen[$0] is implicitly zero/the empty string; the ++ is a post-increment, and therefore doesn't take effect until after the condition is evaluated)
{printf "%s%s",sep,$1; sep=" "} prints the word at hand $1, preceded by separator sep, which is implicitly the empty string for the first word, but a single space for subsequent words, due to setting sep to " " immediately after.

Here's a more flexible variant that handles any run of whitespace between input words; it works with GNU Awk and Mawk^[1]:

awk -v RS='[[:space:]]+' '{if (!seen[$0]++){printf "%s%s",sep,$0; sep=" "}}' <<<"$variable"

-v RS='[[:space:]]s+' tells Awk to split the input into records by any mix of spaces, tabs, and newlines.

^{[1] Unfortunately, BSD/OSX Awk (in strict compliance with the POSIX spec), doesn't support using regular expressions or even multi-character literals as RS, the input record separator.}

Upvotes: 4

anubhava

Reputation: 785068

Using associative arrays in BASH 4+ you can simplify this:

variable="alpha bravo charlie alpha delta echo charlie"

# declare an associative array
declare -A unq

# read sentence into an indexed array
read -ra arr <<< "$variable"

# iterate each word and populate associative array with word as key
for w in "${arr[@]}"; do
   unq["$w"]=1
done

# print unique results
printf "%s\n" "${!unq[@]}"
delta
bravo
echo
alpha
charlie

## if you want results in same order as original string
for w in "${arr[@]}"; do
   [[ ${unq["$w"]} ]] && echo "$w" && unset unq["$w"]
done
alpha
bravo
charlie
delta
echo

Upvotes: 1

Todd A. Jacobs

Reputation: 84343

Use a Bash Substitution Expansion

The following shell parameter expansion will substitute spaces with newlines, and then pass the results into the sort utility to return only the unique words.

$ echo -e "${variable// /\\n}" | sort -u
alpha
bravo
charlie
delta
echo

This has the side-effect of sorting your words, as the sort and uniq utilities both require input to be sorted in order to detect duplicates. If that's not what you want, I also posted a Ruby solution that preserves the original word order.

Rejoining Words

If, as one commenter pointed out, you're trying to reassemble your unique words back into a single line, you can use command substitution to do this. For example:

$ echo $(echo -e "${variable// /\\n}" | sort -u)
alpha bravo charlie delta echo

The lack of quotes around the command substitution are intentional. If you quote it, the newlines will be preserved because Bash won't do word-splitting. Unquoted, the shell will return the results as a single line, however unintuitive that may seem.

Upvotes: 11

evil otto

Reputation: 10582

pure, ugly bash:

for x in $vaviable; do 
    if [ "$(eval echo $(echo \$un__$x))" = "" ]; then
         echo -n $x
         eval un__$x=1
         __usv="$__usv un__$x"
    fi
done
unset $__usv

Upvotes: -1

dawg

Reputation: 103814

You can use awk:

$ echo "$variable" | awk  '{for(i=1;i<=NF;i++){if (!seen[$i]++) printf $i" "}}'
alpha bravo charlie delta echo

If you do not want the trailing space and want a trailing CR, you can do:

$ echo "$variable" | awk  'BEGIN{j=""} {for(i=1;i<=NF;i++){if (!seen[$i]++)j=j==""?j=$i:j=j" "$i}} END{print j}' 
alpha bravo charlie delta echo

Upvotes: 1

Todd A. Jacobs

Reputation: 84343

Preserve Input Order with a Ruby One-Liner

I posted a Bash-specific answer already, but if you want to return only unique words while preserving the word order of the original string, then you can use the following Ruby one-liner:

$ echo "$variable" | ruby -ne 'puts $_.split.uniq'
alpha
bravo
charlie
delta
echo

This will split the input string on whitespace, and then return unique elements from the resulting array.

Unlike the sort or uniq utilities, Ruby doesn't need the words to be sorted to detect duplicates. This may be a better solution if you don't want your results to be sorted, although given your input sample it makes no practical difference for the posted example.

Rejoining Words

If, as one commenter pointed out, you're then trying to reassemble the words back into a single line after deduplication, you can do that too. For that, we just append the Array#join method:

$ echo "$variable" | ruby -ne 'puts $_.split.uniq.join(" ")'
alpha bravo charlie delta echo

Upvotes: 2

jyvet

Reputation: 2191

You may use xargs:

echo "$variable" | xargs -n 1 | sort -u | xargs

Upvotes: 7

How can I display unique words contained in a Bash string?

Answers (8)

Use a Bash Substitution Expansion

Rejoining Words

Preserve Input Order with a Ruby One-Liner

Rejoining Words

Related Questions