Reputation: 106
I'm going to go right out and say this is a homework question, but I feel like I've exhausted my search online for anything relating to how to do this problem or I am just not wording it correctly for Google/Stackoverflow.
The question starts out like this: File words contains a list of words. Each word is on a seperate line. Files story1, story2, ..., story100 are short stories.
It's a multi-part question, but the very last part is stumping me: Find out story files that contain all words that are in file words.
There was a question before it that's similar: Find out story files (print file names) that contain at least one word from file words.
This one I solved by using grep:
grep -l -f words story*
I was under the impression that I would also have to use grep for the last problem, but I can't seem to find an option for grep or anything that would return only those files that match everything that is in a pattern file. It appears I may have to do this with a shell script, but unsure of where to start or if I even need grep for this. Any pointers on how to solve this problem?
Thanks!
EDIT:
These are the correct answers from the solution the instructor gave us.
Question before main question: grep -l -f words story*
Main question:
for story in `ls story*`
do
(( match = 0 ))
for word in `cat words`
do
if [ `grep -l $word $story` ]
then
(( match++ ))
else
break
fi
done
if [ $match -eq `wc -w < words` ]
then
echo $story
fi
done
Thanks everyone for their thoughtful input and answers and sorry I'm a little late on getting this out there.
Upvotes: 2
Views: 1705
Reputation: 10039
for EachFile in story*
do
sed 's/ */\
/g' ${EachFile} | sort -u > /tmp/StoryInList
if [ $( fgrep -w -c -v -f /tmp/StoryInList words ) -eq 0 ]
then
echo ${EachFile}
fi
done
rm /tmp/StoryInList
a bit of code in batch but do the job even if several thousand of words using the grep strength
Upvotes: 0
Reputation: 26
# wcheck: finds story* files that contain all words in words file
# for each file named story... (in this directory)
for file in story*
do
stGood=0 # story is intialized as containing words or true
## for each word in the words file
for word in $(cat words) ; do
## if test using grep exit status for existance of word
if ! grep -q -F $word $file
then
stGood=1 #if word is not found story is set to false
break
fi
done
## if story is still true then filename is printed
if [ $stGood == 0 ]
then
echo $file
fi
done
exit
Upvotes: 1
Reputation: 52030
If you have a list of unique words to search, and for each story the list of unique words it contains, the problem is easier to solve using fgrep -c
:
# remove duplicates words in a file
# place them one per line
function buildWordList() {
sed -e 's/[^[:alpha:]][^[:alpha:]]*/'$"\n"'/g' "$1" |
tr [:upper:] [:lower:] | sort -u | sed '/^$/d'
# ^^^^^^^^^^^^^^^^^^^^^^
# Works for English.
}
TMP=$(mktemp -d)
trap "rm -rf $TMP" EXIT
buildWordList word | sed /.*/^@$/ > $TMP/Words
# ^^^^^^^^
# force whole word matching (as we have 1 word/line)
# `grep -w` might have been instead below. But I don't
# know if this is GNU-specific though
count=$(wc -l < $TMP/Words)
for file in story*
do
# build a list of unique words in the story, one per line
buildWordList "${file}" > $TMP/FileWords
if [ $( grep -c -f $TMP/Words $TMP/FileWords ) -eq $count ]
then
echo "${file}"
fi
done
Upvotes: 0
Reputation: 203995
Assuming your words file contains no RE metacharacters with GNU awk for \<...\>
work boundaries:
To list files containing one word:
awk '
NR==FNR { words["\\<" $0 "\\>"]; next }
{
for (word in words) {
if ($0 ~ word) {
print FILENAME
next
}
}
}
' words story*
To list files containing all words (GNU awk for additionally ENDFILE, delete(array) and length(array)):
awk '
NR==FNR { words["\\<" $0 "\\>"]; next }
{
for (word in words) {
if ($0 ~ word) {
found[word]
}
}
}
ENDFILE {
if ( length(found) == length(words) ) {
print FILENAME
}
delete found
}
' words story*
Upvotes: 1
Reputation: 84579
The brute force method probably isn't the fastest way to do this, but as long as you don't have 100,000+ words and stories, it's fine. Basically, you will just test that each file contains each word using grep, one at a time. If a grep fails to find the word in story, move on to the next story. If all words are found in story, add story to a goodstories array. At the end, just print all goodstories:
#!/bin/bash
declare -a words # array containing all words
declare -a goodstories # array contianing stories with all words
words=( `< /path/to/words` ) # fill words array
## for each stories file (assumed they exist in dir of their own)
for s in `find /path/to/stories/base/dir -type f` ; do
wfound=0 # all words found flag initialized to 'true'
## for each word in words
for w in ${words[@]}; do
## test that word is in story, if not set wfound=1 break
grep -q $w $s &>/dev/null || {
wfound=1
break
}
done
## if grep found all words, add story to goodstories array
test "$wfound" -eq 0 && goodstories+=( $s )
done
## output your list of goodstories
if test "${#goodstories[@]}" -gt 0 ; then
echo -e "\nStories that contained all words:\n"
for s in ${goodstories[@]}; do
echo " $s"
done
else
echo "No stories contained all words"
fi
exit 0
NOTE: I didn't create a words or stories file, so if you find a typo, etc.. consider the code as pseudo code. However, it wasn't just slapped together either...
Upvotes: 1