Searching and Counting Word Fragments in a text file

Question

I was tasked with writing a code that opens a text file, then searches for occurrences of the user's string in the text file and reports how many there were.

The code is below for what I have. It will search for word fragments, which is good, but the professor want it to search for bizzare fragments that have spaces and everything. Something like "of my" or "even g" or any other arbitrary string of characters.

My working code is below, I've been trying to make compareTo work, but I can't seem to get the syntax down. This professor insists on not being helpful and it's a summer class so not TA's to help. I've googled to no avail, it seems I can't put the problem into a decent set of word to search for.

import java.io.File;
import java.io.FileNotFoundException;
import java.util.*;

import javax.swing.*;

public class TextSearchFromFile 
{
public static void main(String[] args) throws FileNotFoundException 
{

    boolean run = true;
    int count = 0;


            //greet user
        JOptionPane.showMessageDialog(null, 
                "Hello, today you will be searching through a text file on the harddrive. 
"
                + "The Text File is a 300 page fantasy manuscript written by: Adam
"
                + "This exercise was intended to have the user enter the file, but since 
"
                + "you, the user, don't know which file the text to search is that is a 
"
                + "bit difficult.

"
                + "On the next window you will be prompted to enter a string of characters.
"
                + "Feel free to enter that string and see if it is somewhere in 300 pages
"
                + "and 102,133 words. Have fun.", 
                "Text Search", 
                JOptionPane.PLAIN_MESSAGE);

    while (run)
    {
        try
        {
                //open the file
            Scanner scanner = new Scanner(new File("An Everthrone Tale 1.txt"));

                //prompt user for word
            CharSequence findWord = JOptionPane.showInputDialog(null, 
                    "Enter the word to search for:", 
                    "Text Search", 
                    JOptionPane.PLAIN_MESSAGE);
            count = 0;


            while (scanner.hasNext())
            {

                if ((scanner.next()).contains(findWord))
                {
                    count++;
                }

            } //end search loop


                //output results to user
            JOptionPane.showMessageDialog(null, 
                    "The results of your search are as follows: 
"
                    + "Your String: " + findWord + "
"
                    + "Was found: " + count + " times.
"
                    + "Within the file: An Ever Throne Tale 1.txt", 
                    "Text Search",
                    JOptionPane.PLAIN_MESSAGE);
        } //end try
        catch (NullPointerException e)
        {
            JOptionPane.showMessageDialog(null, 
                    "Thank you for using the Text Search.", 
                    "Text Search", 
                    JOptionPane.ERROR_MESSAGE);
            System.exit(0);
        }
    } //end run loop
} // end main
} // end class

Just at a loss of how to make it search for crazy arbitrary pieces like that. He knows whats in the text file so he knows he can put sequences together like my examples above that can be found within the text, but they are not.

David Conrad · Accepted Answer

Don't use hasNext() and next() since those will only return a single token at a time from the input file, and you won't be able to find a multi-word phrase (or anything containing spaces). If you use hasNextLine() and nextLine() you can do a little better, but it still won't find cases where "of my" appears with "of" as the last word on one line, and "my" as the first word on the next line. To find that, you need a little more context.

If you keep track of the last line read from the file, you can create a two-line buffer and find instances that are spread across multiple lines.

String last = ""; // initially, last is empty

while (scanner.hasNextLine())
{

    String line = scanner.nextLine();
    String text = last + " " + line; // two-line buffer

    if (text.contains(findWord))
    {
        count++;
    }

    last = line; // remember the last line read

} //end search loop

This should find words broken across two lines, but there are still three problems. First, you could have a phrase like "three lines long" that is broken across three lines:

  three
  lines
  long

You would need to extend the two-line buffer concept to find this. Ultimately, you might need to have the entire file in memory at once, but I suspect that is enough of an edge case that you probably don't care about it.

Second, when words are found on a single line, you will count them twice. Once when the word first appears on the line being read, and a second time when it is in the last line, the previous time it has been read.

Third, using contains in this way won't find multiple copies of the same word on the same line. So if you are looking for "dog" and the following text appears:

  My dog saw a dog today at the dog park which was full of dogs.

The test with contains will only cause count to be incremented once. (But it would happen again when this line was in last.)

So I think you really need to 1. Read the entire file into a buffer, to find phrases split across any number of lines, and 2. Search through the lines using indexOf with an offset that increases until no more matches are found.

String text = "";

if (scanner.hasNextLine())
{
    text += scanner.nextLine(); // first line
}
while (scanner.hasNextLine())
{
    text += " "; // separate lines with a space
    text += scanner.nextLine();
}

int found, offset = 0; // start looking at the beginning, offset 0
while ((found = text.indexOf(findWord, offset)) != -1)
{
    count++; // found a match
    offset = found + 1; // look for next match after this match
}

If you don't care about matches broken across multiple lines, then you can do it one line at a time and avoid the memory cost of having the entire text in memory at once.

Searching and Counting Word Fragments in a text file

Answers (2)

Related Questions