kayla210
kayla210

Reputation: 106

Find all files that contain all words/lines in another file

I'm going to go right out and say this is a homework question, but I feel like I've exhausted my search online for anything relating to how to do this problem or I am just not wording it correctly for Google/Stackoverflow.

The question starts out like this: File words contains a list of words. Each word is on a seperate line. Files story1, story2, ..., story100 are short stories.

It's a multi-part question, but the very last part is stumping me: Find out story files that contain all words that are in file words.

There was a question before it that's similar: Find out story files (print file names) that contain at least one word from file words.

This one I solved by using grep:

grep -l -f words story*

I was under the impression that I would also have to use grep for the last problem, but I can't seem to find an option for grep or anything that would return only those files that match everything that is in a pattern file. It appears I may have to do this with a shell script, but unsure of where to start or if I even need grep for this. Any pointers on how to solve this problem?

Thanks!

EDIT:

These are the correct answers from the solution the instructor gave us.

Question before main question: grep -l -f words story*

Main question:

for story in `ls story*`
do
    (( match = 0 ))

    for word in `cat words`
    do
        if [ `grep -l $word $story` ]
        then
            (( match++ ))
        else
            break
        fi
    done

    if [ $match -eq `wc -w < words` ]
    then
        echo $story
    fi
done

Thanks everyone for their thoughtful input and answers and sorry I'm a little late on getting this out there.

Upvotes: 2

Views: 1705

Answers (5)

NeronLeVelu
NeronLeVelu

Reputation: 10039

for EachFile in story*
 do
    sed 's/  */\
/g' ${EachFile} | sort -u > /tmp/StoryInList
    if [ $( fgrep -w -c -v -f /tmp/StoryInList words ) -eq 0 ]
     then
       echo ${EachFile}
     fi
 done
rm /tmp/StoryInList

a bit of code in batch but do the job even if several thousand of words using the grep strength

Upvotes: 0

Ryan McConn
Ryan McConn

Reputation: 26

# wcheck: finds story* files that contain all words in words file

# for each file named story... (in this directory)
for file in story*
do
    stGood=0  # story is intialized as containing words or true

    ## for each word in the words file
    for word in $(cat words) ; do

        ## if test using grep exit status for existance of word
        if ! grep -q -F $word $file
        then
            stGood=1 #if word is not found story is set to false
            break
        fi   
    done
    ## if story is still true then filename is printed
    if [ $stGood == 0 ]
        then
        echo $file
    fi
done
exit

Upvotes: 1

Sylvain Leroux
Sylvain Leroux

Reputation: 52030

If you have a list of unique words to search, and for each story the list of unique words it contains, the problem is easier to solve using fgrep -c:

# remove duplicates words in a file
# place them one per line
function buildWordList() {
    sed -e 's/[^[:alpha:]][^[:alpha:]]*/'$"\n"'/g' "$1" |
           tr [:upper:] [:lower:] | sort -u | sed '/^$/d'
    #      ^^^^^^^^^^^^^^^^^^^^^^
    #      Works for English. 
}

TMP=$(mktemp -d)
trap "rm -rf $TMP" EXIT

buildWordList word | sed /.*/^@$/ > $TMP/Words
#                        ^^^^^^^^
#                     force whole word matching (as we have 1 word/line)
#                     `grep -w` might have been instead below. But I don't
#                     know if this is GNU-specific though
count=$(wc -l < $TMP/Words)

for file in story*
 do
    # build a list of unique words in the story, one per line
    buildWordList "${file}" > $TMP/FileWords
    if [ $( grep -c -f $TMP/Words $TMP/FileWords ) -eq $count ]
     then
       echo "${file}"
     fi
 done

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203995

Assuming your words file contains no RE metacharacters with GNU awk for \<...\> work boundaries:

To list files containing one word:

awk '
NR==FNR { words["\\<" $0 "\\>"]; next }
{
    for (word in words) {
        if ($0 ~ word) {
            print FILENAME
            next
        }
    }
}
' words story*

To list files containing all words (GNU awk for additionally ENDFILE, delete(array) and length(array)):

awk '
NR==FNR { words["\\<" $0 "\\>"]; next }
{
    for (word in words) {
        if ($0 ~ word) {
            found[word]
        }
    }
}
ENDFILE {
    if ( length(found) == length(words) ) {
        print FILENAME
    }
    delete found
}
' words story*

Upvotes: 1

David C. Rankin
David C. Rankin

Reputation: 84579

The brute force method probably isn't the fastest way to do this, but as long as you don't have 100,000+ words and stories, it's fine. Basically, you will just test that each file contains each word using grep, one at a time. If a grep fails to find the word in story, move on to the next story. If all words are found in story, add story to a goodstories array. At the end, just print all goodstories:

#!/bin/bash

declare -a words        # array containing all words
declare -a goodstories  # array contianing stories with all words

words=( `< /path/to/words` )    # fill words array

## for each stories file (assumed they exist in dir of their own)
for s in `find /path/to/stories/base/dir -type f` ; do

    wfound=0                    # all words found flag initialized to 'true'

    ## for each word in words
    for w in ${words[@]}; do

        ## test that word is in story, if not set wfound=1 break
        grep -q $w $s &>/dev/null || {

            wfound=1
            break

        }

    done

    ## if grep found all words, add story to goodstories array
    test "$wfound" -eq 0 && goodstories+=( $s )

done

## output your list of goodstories

if test "${#goodstories[@]}" -gt 0 ; then

    echo -e "\nStories that contained all words:\n"
    for s in ${goodstories[@]}; do

        echo "  $s"

    done

else

    echo "No stories contained all words"

fi

exit 0

NOTE: I didn't create a words or stories file, so if you find a typo, etc.. consider the code as pseudo code. However, it wasn't just slapped together either...

Upvotes: 1

Related Questions