david8146
david8146

Reputation: 1

Generating word counts while sorting by order of appearance in bash

I am trying to sort a file of words in the order they appear in the file (I am only interested in certain words in the file). The first word appearing at the top of the output and the last word appearing at the bottom.

The usual way to generate a word count, with sort | uniq -c, eliminates sort order. How can I generate this frequency count without losing that ordering?

Sample text file:

Godard conscious aioli Ouija Aeolus victorious furious perfect family twelve silver seven promiscuous radioactive one you Thursday heart Ate pizza contagious near princess ion water ace igneous ambitious

Sample output:

1 conscious 
1 aioli 
1 Ouija 
1 Aeolus 
1 victorious 
1 furious 
1 promiscuous 
1 radioactive 
1 contagious 
1 igneous 
1 ambitious

Upvotes: 0

Views: 287

Answers (5)

ghoti
ghoti

Reputation: 46836

I thought I should get in on this too.

Here's a one-liner, just for Charles:

gawk -v RS="[[:space:]]+" '{$0=tolower($0)} /[aeiou]{3}/ && !($0 in p) {p[$0]=n++} /[aeiou]{3}/ {a[p[$0]]=$0;c[p[$0]]++} END { for (i=0;i<n;i++) printf "%6d %s\n",c[i],a[i] }' input.txt

Broken out for easier reading (and commenting):

#!/usr/bin/env gawk -f

BEGIN {
  RS="[[:space:]]+"               # Set a reasonable record separator
}                                 # (includes spaces and newlines)

{
  $0=tolower($0)                  # ignore case...
}

/[aeiou]{3}/ && !($0 in p) {      # if we've found a word, make sure
  p[$0]=n++                       # we have a pointer to it.
}

/[aeiou]{3}/ {                    # if we've found a word and have a pointer,
  a[p[$0]]=$0                     # make a record of the word,
  c[p[$0]]++                      # and increment its counter.
}

END {                             # Once everything's been processed,
  for (i=0;i<n;i++)               # step through our list, and
    printf "%6d %s\n",c[i],a[i]   # print the results.
}

This covers multiple forms of whitespace, counts accurately, and keeps words in order. Oh, and it does this in a single pass.

Upvotes: 1

sjsam
sjsam

Reputation: 21965

Considering a more possible input

cat txt1

Godard conscious aioli Ouija Aeolus victorious furious perfect family twelve silver seven promiscuous radioactive one you Thursday ambitious heart Ate pizza contagious near princess ion water ace ambitious igneous ambitious conscious

below awk script would do the trick :

 awk 'NR==FNR {v[i++]=$0;c[$0]++;next}END{
  for(j=0;j<i;j++){if(p[v[j]]==0){print c[v[j]],v[j]}
  p[v[j]]=c[v[j]]>1?1:0;}
  }' <(awk -v RS=' +|\n' '$0 ~ /(.*[aAeEiIoOuU].*){3}/' txt1)

Output

2 conscious
1 aioli
1 Ouija
1 Aeolus
1 victorious
1 furious
1 promiscuous
1 radioactive
3 ambitious
1 contagious
1 igneous

Upvotes: 0

karakfa
karakfa

Reputation: 67467

awk to the rescue!

double scan to get the counts

$ awk -v RS=' +|\n' 'NR==FNR {t=$0; if(gsub(/[aeiou]/,"")>2) a[t]++; next} 
                     $0 in a {print a[$0],$0; delete a[$0]}' file{,}
1 conscious
1 aioli
1 Ouija
1 Aeolus
1 victorious
1 furious
1 promiscuous
1 radioactive
1 contagious
1 igneous
1 ambitious

from the sorted list extracted some other means, this will generate the counts based on the input sorting

 $ awk -v RS=' +|\n' '{t=$0} gsub(/[aeiou]/,"")>2{print t}' file | 
   # or some other means to generate filtered words ...
   cat -n        |     # add line number
   sort -k2 -k1n |     # sort by words and line number
   uniq -f1 -c   |     # find counts skipping line number
   sort -k2n     |     # sort by original line number
   awk '{print $1,$3}' # remove the line number

Upvotes: 3

Charles Duffy
Charles Duffy

Reputation: 295353

Part One: Extracting Matching Words

The following command:

s='Godard conscious aioli Ouija Aeolus victorious furious perfect family twelve silver seven promiscuous radioactive one you Thursday heart Ate pizza contagious near princess ion water ace igneous ambitious'
tr '[[:space:]]' '\n' <<<"$s" | egrep -i '[aeoiu].*[aeiou].*[aeiou]'

...generates the output:

conscious
aioli
Ouija
Aeolus
victorious
furious
promiscuous
radioactive
contagious
igneous
ambitious

...which properly contains the subset of words with at least three vowels, in their original order of appearance.


Part Two: Adding A Counter While Maintaining Sort Order

To maintain a counter requires either maintaining state or doing multiple passes.

#!/usr/bin/env bash
if [[ -z $BASH_VERSION ]] || [[ $BASH_VERSION = [1-3].* ]]; then
  echo "ERROR: This requires bash 4.0 or newer" >&2
  exit 1
fi

### Begin code from Part 1
s='Godard conscious aioli Ouija Aeolus victorious furious perfect family twelve silver seven promiscuous radioactive one you Thursday heart Ate pizza contagious near princess ion water ace igneous ambitious'
get_words() { tr '[[:space:]]' '\n' <<<"$s" | egrep -i '[aeoiu].*[aeiou].*[aeiou]'; }
### End code from Part 1

declare -a var_order=( )
declare -A var_count=( )
while IFS= read -r var; do
  if (( ${var_count[$var]} )); then
    var_count[$var]=$(( ${var_count[$var]} + 1 ))
  else
    var_order+=( "$var" )
    var_count[$var]=1
  fi
done < <(get_words)

for var in "${var_order[@]}"; do
  printf '% -4d %s\n' "${var_count[$var]}" "$var"
done

...which properly generates the output:

 1   conscious
 1   aioli
 1   Ouija
 1   Aeolus
 1   victorious
 1   furious
 1   promiscuous
 1   radioactive
 1   contagious
 1   igneous
 1   ambitious

Upvotes: 2

glenn jackman
glenn jackman

Reputation: 246774

In plain bash, you could do:

set -f
shopt -s nocasematch
for word in $(< words.txt); do 
    [[ $word == *[aeiou][aeiou][aeiou]* ]] && echo $word
done

That just prints out the words with 3 consecutive vowels, it does not count them.

Upvotes: 0

Related Questions