Wanmi Siangshai
Wanmi Siangshai

Reputation: 142

Pattern matching in if statement in bash

I'm trying to count the words with at least two vowels in all the .txt files in the directory. Here's my code so far:

#!/bin/bash

wordcount=0


for i in $HOME/*.txt
do
cat $i |
while read line
do
    for w in $line
    do
    if [[ $w == .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
    then
        wordcount=`expr $wordcount + 1`
        echo $w ':' $wordcount  
    else
        echo "In else"
    fi
    done
done
echo $i ':' $wordcount
wordcount=0
done

Here is my sample from a txt file

Last modified: Sun Aug 20 18:18:27 IST 2017
To remove PPAs
sudo apt-get install ppa-purge
sudo ppa-purge ppa:

The problem is it doesn't match the pattern in the if statement for all the words in the text file. It goes directly to the else statement. And secondly, the wordcount in echo $i ':' $wordcount is equal to 0 which should be some value.

Upvotes: 4

Views: 11720

Answers (3)

nagendra547
nagendra547

Reputation: 6330

Using grep - this is pretty simple to do.

#!/bin/bash

wordcount=0
for file in ./*.txt
do
count=`cat $file | xargs -n1 | grep -ie "[aeiou].*[aeiou]" | wc -l`
wordcount=`expr $wordcount + $count`
done

echo $wordcount

Upvotes: 0

Charles Duffy
Charles Duffy

Reputation: 295845

Immediate Issue: Glob vs Regex

[[ $string = $pattern ]] doesn't perform regex matching; instead, it's a glob-style pattern match. While . means "any character" in regex, it matches only itself in glob.

You have a few options here:

  1. Use =~ instead to perform regular expression matching:

    [[ $w =~ .*[aeiouAEIOU].*[AEIOUaeiou].* ]]
    
  2. Use a glob-style expression instead of a regex:

    [[ $w = *[aeiouAEIOU]*[aeiouAEIOU]* ]]
    

    Note the use of = rather than == here; while either is technically valid, the former avoids building finger memory that would lead to bugs when writing code for a POSIX implementation of test / [, as = is the only valid string comparison operator there.


Larger Issue: Properly Reading Word-By-Word

Using for w in $line is innately unsafe. Use read -a to read a line into an array of words:

#!/usr/bin/env bash

wordcount=0
for i in "$HOME"/*.txt; do
  while read -r -a words; do
    for word in "${words[@]}"; do
      if [[ $word = *[aeiouAEIOU]*[aeiouAEIOU]* ]]; then
        (( ++wordcount ))
      fi
    done
  done <"$i"
  printf '%s: %s\n' "$i" "$wordcount"
  wordcount=0
done

Upvotes: 7

John1024
John1024

Reputation: 113994

Try:

awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt

Sample output looks like:

$ awk '/[aeiouAEIOU].*[AEIOUaeiou]/{n++} ENDFILE{print FILENAME":"n; n=0}' RS='[[:space:]]' *.txt
one.txt:1
sample.txt:9

How it works:

  • /[aeiouAEIOU].*[AEIOUaeiou]/{n++}

    Every time we find a word with two vowels, we increment variable n.

  • ENDFILE{print FILENAME":"n; n=0}

    At the end of each file, we print the name of the file and the 2-vowel word count n. We then reset n to zero.

  • RS='[[:space:]]'

    This tells awk to use any whitespace as a word separator. This makes each word into a record. Awk reads the input one record at a time.

Shell issues

The use of awk avoids a multitude of shell issues. For example, consider the line for w in $line. This will not work the way you hope. Consider a directory with these files:

$ ls
one.txt  sample.txt

Now, let's take line='* Item One' and see what happens:

$ line='* Item One'
$ for w in $line; do echo "w=$w"; done
w=one.txt
w=sample.txt
w=Item
w=One

The shell treats the * in line as a wildcard and expands it into a list of files. Odds are you didn't want this. The awk solution avoids a variety of issues like this.

Upvotes: 1

Related Questions