Reputation: 1
I am trying to sort a file of words in the order they appear in the file (I am only interested in certain words in the file). The first word appearing at the top of the output and the last word appearing at the bottom.
The usual way to generate a word count, with sort | uniq -c
, eliminates sort order. How can I generate this frequency count without losing that ordering?
Sample text file:
Godard conscious aioli Ouija Aeolus victorious furious perfect family twelve silver seven promiscuous radioactive one you Thursday heart Ate pizza contagious near princess ion water ace igneous ambitious
Sample output:
1 conscious
1 aioli
1 Ouija
1 Aeolus
1 victorious
1 furious
1 promiscuous
1 radioactive
1 contagious
1 igneous
1 ambitious
Upvotes: 0
Views: 287
Reputation: 46836
I thought I should get in on this too.
Here's a one-liner, just for Charles:
gawk -v RS="[[:space:]]+" '{$0=tolower($0)} /[aeiou]{3}/ && !($0 in p) {p[$0]=n++} /[aeiou]{3}/ {a[p[$0]]=$0;c[p[$0]]++} END { for (i=0;i<n;i++) printf "%6d %s\n",c[i],a[i] }' input.txt
Broken out for easier reading (and commenting):
#!/usr/bin/env gawk -f
BEGIN {
RS="[[:space:]]+" # Set a reasonable record separator
} # (includes spaces and newlines)
{
$0=tolower($0) # ignore case...
}
/[aeiou]{3}/ && !($0 in p) { # if we've found a word, make sure
p[$0]=n++ # we have a pointer to it.
}
/[aeiou]{3}/ { # if we've found a word and have a pointer,
a[p[$0]]=$0 # make a record of the word,
c[p[$0]]++ # and increment its counter.
}
END { # Once everything's been processed,
for (i=0;i<n;i++) # step through our list, and
printf "%6d %s\n",c[i],a[i] # print the results.
}
This covers multiple forms of whitespace, counts accurately, and keeps words in order. Oh, and it does this in a single pass.
Upvotes: 1
Reputation: 21965
Considering a more possible input
cat txt1
Godard conscious aioli Ouija Aeolus victorious furious perfect family twelve silver seven promiscuous radioactive one you Thursday ambitious heart Ate pizza contagious near princess ion water ace ambitious igneous ambitious conscious
below awk
script would do the trick :
awk 'NR==FNR {v[i++]=$0;c[$0]++;next}END{
for(j=0;j<i;j++){if(p[v[j]]==0){print c[v[j]],v[j]}
p[v[j]]=c[v[j]]>1?1:0;}
}' <(awk -v RS=' +|\n' '$0 ~ /(.*[aAeEiIoOuU].*){3}/' txt1)
Output
2 conscious
1 aioli
1 Ouija
1 Aeolus
1 victorious
1 furious
1 promiscuous
1 radioactive
3 ambitious
1 contagious
1 igneous
Upvotes: 0
Reputation: 67467
awk
to the rescue!
double scan to get the counts
$ awk -v RS=' +|\n' 'NR==FNR {t=$0; if(gsub(/[aeiou]/,"")>2) a[t]++; next}
$0 in a {print a[$0],$0; delete a[$0]}' file{,}
1 conscious
1 aioli
1 Ouija
1 Aeolus
1 victorious
1 furious
1 promiscuous
1 radioactive
1 contagious
1 igneous
1 ambitious
from the sorted list extracted some other means, this will generate the counts based on the input sorting
$ awk -v RS=' +|\n' '{t=$0} gsub(/[aeiou]/,"")>2{print t}' file |
# or some other means to generate filtered words ...
cat -n | # add line number
sort -k2 -k1n | # sort by words and line number
uniq -f1 -c | # find counts skipping line number
sort -k2n | # sort by original line number
awk '{print $1,$3}' # remove the line number
Upvotes: 3
Reputation: 295353
The following command:
s='Godard conscious aioli Ouija Aeolus victorious furious perfect family twelve silver seven promiscuous radioactive one you Thursday heart Ate pizza contagious near princess ion water ace igneous ambitious'
tr '[[:space:]]' '\n' <<<"$s" | egrep -i '[aeoiu].*[aeiou].*[aeiou]'
...generates the output:
conscious
aioli
Ouija
Aeolus
victorious
furious
promiscuous
radioactive
contagious
igneous
ambitious
...which properly contains the subset of words with at least three vowels, in their original order of appearance.
To maintain a counter requires either maintaining state or doing multiple passes.
#!/usr/bin/env bash
if [[ -z $BASH_VERSION ]] || [[ $BASH_VERSION = [1-3].* ]]; then
echo "ERROR: This requires bash 4.0 or newer" >&2
exit 1
fi
### Begin code from Part 1
s='Godard conscious aioli Ouija Aeolus victorious furious perfect family twelve silver seven promiscuous radioactive one you Thursday heart Ate pizza contagious near princess ion water ace igneous ambitious'
get_words() { tr '[[:space:]]' '\n' <<<"$s" | egrep -i '[aeoiu].*[aeiou].*[aeiou]'; }
### End code from Part 1
declare -a var_order=( )
declare -A var_count=( )
while IFS= read -r var; do
if (( ${var_count[$var]} )); then
var_count[$var]=$(( ${var_count[$var]} + 1 ))
else
var_order+=( "$var" )
var_count[$var]=1
fi
done < <(get_words)
for var in "${var_order[@]}"; do
printf '% -4d %s\n' "${var_count[$var]}" "$var"
done
...which properly generates the output:
1 conscious
1 aioli
1 Ouija
1 Aeolus
1 victorious
1 furious
1 promiscuous
1 radioactive
1 contagious
1 igneous
1 ambitious
Upvotes: 2
Reputation: 246774
In plain bash, you could do:
set -f
shopt -s nocasematch
for word in $(< words.txt); do
[[ $word == *[aeiou][aeiou][aeiou]* ]] && echo $word
done
That just prints out the words with 3 consecutive vowels, it does not count them.
Upvotes: 0